After working around a minor bug in the Introduction, Part I elucidated how I lost a week discovering and working around the hydra bug for my KVM/vf based virtual machine management prototype. Part II explained how I lost another week making the iPXE intelxvf driver more robust.
Everything was working, but I was annoyed that I didn’t actually fix the hydra bug and worked around it instead. I may not be able to control what kernel gets booted. It therefore makes sense to prevent hangs on the iPXE side, but I didn’t know exactly why it fixed the problem, and that bothered me. I had cheated and it felt dirty. I couldn’t rest until I had rectified the problem. Read on for the real solution.
Months have passed since I last worked on this, so my first step is to reproduce the original issue. I revert my workaround and try to boot kernel 3.8. Same issue. I make sure to pull the most recent release of UEK3 just in case the bug has been fixed. No luck.
I need to know where the kernel is hung. My first attempt is to try to enable the sysrq key via the kernel command line so I can force it to stack trace. Unfortunately, no matter how much I mash print-screen, there is no response from the hung kernel. Perhaps I’m too early in the boot process?
It looks like I’m going to have to get down and dirty with gdb to get a stack trace. I download the debuginfo packages for the guest kernel and start qemu stopped with a debug listener (-s
and -S
options). I remotely connect gdb, load the debug-ready kernel image, and type c
for continue.
All of the online help talks about debugging the kernel by booting it directly with qemu. When qemu starts my kernel isn’t loaded. It jumps through the BIOS, iPXE and grub before finally reaching the kernel code. I therefore assume that I’m going to have to figure out where the kernel is loaded in memory and tell gdb.
Once I’m sure the guest is hung, I hit Ctrl-C, and to my surprise I see source code. Upon further reflection this makes sense, as the early kernel boot is going to load its routines into the same place in memory. Or maybe gdb is just magic. Looking at the backtrace it seems to be stuck in a while loop in kvm_steal_clock. Ok time to investigate how the kvm clock works.
It appears that the kvm clock communicates with the host by allocating some shared memory that the host side writes to and the guest reads from. After a quick check to see if any of the code in that function had been changed in UEK, I do notice that it is wrapped in an #ifndef CONFIG_DTRACE
, but the function is definitely being called so that is a red herring.
Just to verify that the problem is at this location, I try to disable the kvmclock completely. Passing in cpu=host,-kvmclock
on the qemu command line changes the hang location slightly but doesn’t seem to solve the problem. I reason that the memory used by the kvmclock is somehow conflicting with iPXE, and maybe removing it is just corrupting something else.
Reading the early memory slab allocation in the kernel source, it looks like it is allocating a large chunk that includes part of the reserved area (this was the value for NODE_DATA
that was printed out way back in Part I. Despite that, the kvmclock code prints out the memory locations that it is using, and, while they are close to the iPXE reserved region, they don’t seem to be directly conflicting.
I decide to step through the kvm clock initialization to see if it can shed any more light. I restart the debugging session, type in b kvmclock_init
followed by c
and wait for the breakpoint to hit. I reach the hang without hitting the breakpoint. What the? Is kvmclock_init getting skipped? No, the console shows the debug statement from the middle of the function, so it definitely ran.
I search the internet for qemu breakpoints and stumble across a page that mentions that hardware breakpoints may work better than software breakpoints. I recall having issues with software breakpoints when I was developing a bootloader using qemu long ago. Something about being in 16-bit mode confused the debugger. I try a hardware breakpoint instead. It stops this time but gives me the cryptic message Remote ‘g’ packet reply is too long
.
All x86 processors start in 16-bit real-mode for compatibility. When intel introduced 32-bit processors, 32 bit instructions were hidden in protected-mode. Even modern 64-bit processors start in real-mode and call specialized instructions to eventually enter a mode where 64-bit instructions are available. When you connect to qemu before the vm executes (using qemu -s -S
), gdb detects that the processer is in real-mode and configures itself appropriately.
Unfortunately, gdb doesn’t deal well with mode switching during a remote debugging session. Software breakpoints are lost, and it doesn’t decode packets from the remote side properly. It is possible that a more recent version of gdb and/or gdbserver would solve this problem, but a couple of workarounds are sufficient to get moving again:
The packet reply issue is solved by a reconnect and I’m able to dump memory through gdb and through a couple of qemu console commands. There definitely seems to a value in the location that the kvmclock is using to communicate with the host, but it isn’t obvious what data is there and why it might be breaking.
I do another run with a newer kernel to see if I can spot any differences. I notice that the broken kernel refers to instantiating the “boot clock”, whereas the working kernel refers to the “primary clock”. Hmm, the code for instantiating the primary clock seems to be in a different location. Perhaps it is actually failing the initialization in the newer kernel but not hanging for some reason?
I spend far too much trying to track down the difference in the primary clock. After a couple of hours of being frustrated in gdb, I finally discover that I’m looking at an older version of the code. In the actual working kernel code, the duplicate initialization of the primary clock has been removed and the name of the clock is changed from “boot clock” to “primary clock”. So the new kernel is actually running the same code. What a red-herring that was!
Next I go down the rabbit hole of attempting to understand the early memory allocation process. The memblock allocator seems to find the highest available address by looking at the end of the available memory reported by the kernel and then slicing off a chunk. As far as I can tell by reading the code, any reserved sections are only taken into account later when the page table is constructed. I start to think that there is a bug in the early boot process where allocations before the page table is initialized could conflict with BIOS-reserved memory.
There are a couple of problems with my theory:
It seems unlikely that a massive bug like this would have made it through 10+ major versions of the kernel without getting noticed.
If the bug is definitely in the kvmclock code, maybe the memblock allocation is ok after all.
I take another stab at verifying the bug is definitively in the kvmclock. I notice a couple of command-line parameters that seem to control the initialization of the clock from the guest side. Trying various combinations of these parameters, I discover that if i specify no-kvmclock
and no-kvmclock-vsyscall
then the guest actually boots. This looks like hard evidence that the problem is in kvmclock.
In the end I get lucky. Scouring the net for various function calls in kvmclock_init, I stumble across a very promising bug report. The details refer to a hang because kvmclock is reading unitialized memory when there is more than one processor. As a test, I lower the number of processors to 1. The kernel boots! This bug report is definitely about our bug.
The fix for the bug is fairly trivial. It was a brief regression introduced in 3.8 and fixed in 3.10. Other distros had backported it. I check the UEK source and it has not been backported. The last step is to verify that the fix actually solves our hang.
I’d prefer not to do a full kernel build, so instead, I fire up gdb and set a breakpoint where the offending memory is allocated. After the breakpoint, I manually call memset to clear the memory, and continue the boot. Eureka! I have a working 3.8 kernel with kvmclock enabled. I send a bug report to the UEK team, confident that the next release of the 3.8 kernel will be free from this tenacious little bug.
So it turns out that the kernel memblock allocation is just fine. It just so happens that it allocates some previously used memory for the kvmclock shared area. Due to a minor change in the implementation, the memory for the kvmclock for secondary processors is not initialized, so it attempts to read a garbage value and loops forever.
The workaround in Part I was moving the kvmclock to memory that happened to be initialized to zero as a side-effect. This is good to know, because the workaround will fail if something in the boot process (BIOS, iPXE, or grub) happens to write data at the new location. In other words, my workaround mostly works by chance.
So what is the moral of the story? Search for existing bug reports earlier! I had dreams of finding an interesting problem in kernel memory allocation, but if I had just googled more directly about some of the lines of code around the actual hang, I would have discovered the key bug report much sooner.
Is a missed backport a fitting conclusion to three weeks worth of debugging? Perhaps not. But at least I will sleep easier knowing why more poor little kernel couldn’t boot. No bug left behind!