A Gentle Start
A few months ago I was prototyping some KVM-based virtual machine management code. The goal was to boot a linux vm over iSCSI using pci-passthrough of virtual functions on an Intel NIC. I expected it to be straightforward. What followed was weeks of some of the hairiest debugging I have ever faced, leading to a bug I have named “The Hydra Bug”, because, like a hydra, each head I removed caused a new one to sprout.
The vm boot uses a piece of open source software called iPXE. To understand my basic workflow, check out Part I and Part II of my tutorial on KVM + iSCSI. The first issue I faced was during the transition to the virtual function as described in KVM + iSCSI Part II. Read on for the story.
After verifying that I can boot my image properly using a virtio NIC, I switch over to the prepared virtual function. I see iPXE starting in my VNC console, but it quickly prints an error. I check the the iPXE docs and turn on debugging for the intel driver (
DEBUG=intel,intelxvf), but none of the messages look suspicious. Maybe, I reason, if I compare the debug output to similar output from the virtio driver, I can see what it is doing differently.
I build iPXE with the virtio model number and debug turned on (
make bin/1af41000.rom DEBUG=virtio-net) and specify that qemu should use it by passing a romfile parameter. Strangely, i see the same error in my console.
What the? I go back and forth a few times trying to figure out what I might have done wrong and on the second try with the virtio NIC, it boots successfully. Oddly, though, there is no debug output in the console. At this point, I’m convinced I’m going insane. Through sheer dumb luck, I notice that I had forgotten to include the romfile parameter in the command line. My debug build was not being loaded, and it was automatically falling back to the older virtio iPXE option rom that came with my qemu package. Aha!
So an old version of iPXE works, and the new one is broken. At this point, the easy solution would be to go back to the old working commit. Unfortunately, I can’t just use the old version because virtual function support was added to iPXE relatively recently. Besides, I had built iPXE at some point in the past year and I was pretty sure it was working then. I deduce that it must have been a fairly recent commit that broke things. If only i had an easy way to determine what broke things. Git-bisect to the rescue!
Brief Aside: Git Bisect
Git-bisect is a fantastic tool for finding the point in history where a bug was introduced. It uses a binary search to find the offending commit in the most efficient way possible. Bisecting is much easier if every commit in the tree is build-able, which is an excellent reason to keep your git history clean.
Although the bisect process can be scripted, it is fairly easy to do manually. With the broken commit checked out and the sha of an older, working commit, you begin the bisect:
git bisect start
git bisect bad # current HEAD doesn’t work
git bisect good $SHA # SHA is the working commit
It will checkout the commit for you and tell you how many steps remain. You can then build the software and test it. If it works you type
git bisect good. Otherwise you type
git bisect bad. It will then automatically check out a new commit and you repeat the process. Eventually you will zero in on the exact commit that introduced the bug.
To speed up the testing cycle, I write a script to build iPXE and boot the vm. Even though there had been thousands of commits between the working build and the broken one, bisecting quickly reveals the commit that broke things had only been merged two weeks before.
The fix is anything but obvious (the bug is in some incomprehensible low level assembly for BIOS disk reads), so I report it upstream and resolve to use iPXE from the commit right before it broke for my further testing. With my new iPXE build in hand, I am able to successfully boot my cirros image.
Now that the build is working, I am primed to discover the hydra bug itself. I discover the bug and a workaround in Part I. Part II detours into a related bug. The bug is fully analyzed and fixed in Part III.