One Night in Bisect Makes a Hard Man Humble
The Introduction explained how I managed to get a vm booting over iSCSI using virtual functions. Everything seemed to be going well, but I was about to discover a serious problem when I tried a full linux OS. Read on for the discovery of the hydra bug.
I have a working boot of a busybox-based linux, but I need to try a full linux OS. I grab an Oracle Linux that I know to work with iSCSI on real hardware, create a new iSCSI LUN, and attempted to boot it.
Within a few seconds, I am excited to see grub popping up on my virtual console. With bated breath, I wait for the beauty of the login screen to greet me. I wait, and wait, and wait… Nothing but a flashing cursor. At first I suspect that the image has been corrupted, but a test booting from the target on real hardware works fine.
I suspect it is having some sort of problem passing off the iSCSI volume from iPXE to the kernel. I add various debug options to the kernel command line to see what might be going on early in the boot process. Still nothing.
In desperation, I try early_printk=console. Finally, some text! I see logs about memory allocations from the BIOS, some messages about ACPI starting up, then the same flashing cursor.
I try kernel command-line options for turning off ACPI and disabling other parts of the early boot process, but nothing seems to help. Could it be another bug in iPXE that is causing the issue? Back to git-bisect.
The first thing I must do is find a working commit. I switch back to a virtio NIC and verify that it hangs. Then I try the old iPXE build from the ipxe-roms package. Amazingly, it works.
A few hours of bisecting and testing later I have found the commit that causes the hang. I check out the commit before and run the build for the virtual function… failure. It turns out the breaking commit is from before driver support for my NIC had been added. Fortunately, the commit is about changing compression and I reason that I can revert it without too much trouble. I switch back to a recent commit add a revert of the broken one and produce a new build.
Elated, sure that I have a solution in hand, I boot… and see the same evil flashing cursor. s/elated/deflated/g. I must have missed something! I triple-check my build and my command-line options. I switch back to virtio. No dice. Reverting that commit did not fix the issue. There must be another more recent commit that causes the same hang. It’s getting late, but I am too annoyed to sleep. I check out the offending commit, add a revert right after it, and rebase all of the newer commits on top of it. Then I git-bisect again.
I quickly find a second commit that causes the issue and it is also old. Now I have two reverts and the cursor is still haunting me. It seems that each time I revert a commit, I find another to take its place. Apparently, this bug is like a hydra: each time I cut of a head, a new one takes its place.
What follows is an entire night of bisecting, rebasing, and testing. By the time the sun rises I have 5 different commits that all cause the same issue. I have a working build with a whole bunch of random reverts and rebases. This build is totally unmaintainable. To even have a chance to maintain it, I would have to stick all of the reverts at the top of the tree. Unfortunately, all of them were changed by more recent commits, so I would have to rewrite all of the reverts by hand and hope I didn’t break anything.
Also, alarm bells are going off. Something is very fishy. How could 5 totally unrelated commits cause the same bug? One of the commits just adds an extra parameter to a function call. How on earth could this break the kernel boot? I decided to produce the most minimal example of a working vs. a broken build to see what might be different.
I boot with early_printk enabled both with and without the above commit. I diff the output and discover that there is a very minor difference between the two runs. Early in the boot process, the kernel logs which memory ranges the bios reports free:
e820: BIOS-provided physical RAM map:
BIOS-e820: [mem 0x0000000000000000–0x000000000009c3ff] usable
BIOS-e820: [mem 0x00000000000f0000–0x00000000000fffff] reserved
BIOS-e820: [mem 0x0000000000100000–0x000000007fef6fff] usable
BIOS-e820: [mem 0x000000007ff95000–0x000000007ffdffff] usable
The vm is allocated 2G of memory. Note the small gap near the end of 2G that is reported usable. The boots that hang report a smaller amount of memory free. Testing various builds, I narrow it down to 109 (4K) pages. If there are more than 108 pages free, the kernel boots. If the gap is smaller, it hangs. This seems to imply that the kernel is writing over some important part of iPXE.
Brief Aside: The BIOS e820 Routine
Some memory in the early boot process is used by the BIOS. Everything would go haywire if the bootloader or kernel overwrote chunks of the BIOS with its own data, so the BIOS has an interface so that the kernel can ask it what memory it is using.
To communicate with the BIOS, it puts e820 in the AX register, calls INT 15 and the BIOS gives the kernel a map of used memory. The iPXE code replaces the bios interrupt handler with its own code that adds itself to the memory map. The result of this is that a larger build of iPXE will report more reserved memory in the memory map.
A differing amount of reserved memory explains how multiple patches can cause the same issue. Any change that makes the iPXE binary larger could shrink the gap below the required 109 pages. Unfortunately, it doesn’t really tell us why the size of this gap is important. It appears that iPXE is properly reporting reserved memory, so the kernel shouldn’t be overwriting it. Perhaps the kernel is not respecting the reserved space? In, fact, some logs a bit later in the boot seem to indicate that the kernel is using space that it shouldn’t:
No NUMA configuration found
Faking a node at [mem 0x0000000000000000–0x000000007ffdffff]
Initmem setup node 0 [mem 0x00000000–0x7ffdffff]
NODE_DATA [mem 0x7ffb9000–0x7ffdffff]
The beginning of the node memory (0x7ffb9000) overlaps with iPXE (which ends at 0x7ff95000). Have we found a kernel bug? My Oracle linux image has an older, modified kernel (UEK, a 3.8 kernel with dtrace and other features added), so my first step is to see if the issue still exists in a newer kernel. Trying the standard Oracle linux (3.10 based) kernel prints the same allocation information at boot, but, lo-and-behold, it passes the hang and makes it all the way to the login prompt.
I try a newer UEK (based on 4.1) and it is fine. I find a fedora 3.8 kernel and it also boots. It appears that whatever is going wrong is specific to this old UEK. At this point, I have a choice. I can start trying to debug this particular kernel, or I can attempt to workaround the bug on the iPXE side. What if i reserve those 108 (or less, the current build only leaves 75) pages in memory after iPXE? It’s just 400K of ram. Today, that is a drop in the bucket.
The modification to iPXE is trivial:
@@ -37,3 +37,3 @@ FILE_LICENCE ( GPL2_OR_LATER_OR_UBDL );
/** Alignment for hidden memory regions */
-#define ALIGN_HIDDEN 4096 /* 4kB page alignment should be enough */
+#define ALIGN_HIDDEN 0x80000
One final build, and, voila, the kernel boots just fine. I still want to find the bug, but I have more pressing matters, so that will have to wait (until Part III). You see, the vm is working totally fine, but if I reboot it, iPXE fails to iSCSI boot. My work is not done. The story is continues in Part II.