The Hydra Bug: Part II

iPXE — 60% of the Time, it Works Every Time

After working around a minor bug in the Introduction, Part I elucidated how I lost a week discovering and working around the hydra bug for my KVM/vf based virtual machine management prototype.

I had a build of iPXE that worked… the first time. I stretched the truth slightly in Part I. The failure to reboot wasn’t a totally a surprise. I had run into the bug a few times during prototyping, but I would fix it by using a different vf or rebooting the host machine. Read on for part II of the adventure.

On the second boot of a vm, iPXE fails to communicate via iSCSI. Turning on debugging for the intelxvf driver reveals that no packets are being received. Once the vf is in this state, it will remain broken until the host is rebooted. Clearly there is some persistent state in the virtual function that is not being cleared.

Brief Aside: Intel NIC Configuration

Configuration for the Intel 82599 NIC is performed by writing to specific DMA locations. Individual locations control different features and are referred to in the data sheet as “registers”. For example, there is a register to control where packet data should be read from.

Some registers can only be written to via the physical function (e.g. on the hypervisor host), but other registers can be written via the virtual function. This is how the virtual function driver sets up transmit and receive queues.

If a bug is persisting across multiple instantiations of a kernel/driver, it must be because some register value has been changed. The driver in iPXE sets register values itself so it can communicate, but the linux driver later reconfigures the registers once the kernel has taken over.

I read through the Intel datasheet trying to understand what might be going wrong. I find the iPXE code for setting up the registers to see if it is failing to reset something. That’s odd, the code is quite clearly sending a message to the physical device asking for a reset. Why wouldn’t it be resetting?

The driver code for the physical function (that’s the driver for the physical device on the hypervisor side), I am surprised to discover, doesn’t reset everything when it receives a vf reset message. While some values (like mac address) are reset, others are not (like send and receive queues). This gives me a hypothesis: The vf driver in linux must be setting some up some registers, and when iPXE runs the second time it is using incorrect values.

I peruse the code, looking for a host-side workaround to perform a complete reset of the vf. Eventually I find one buried in ethtool. I can now reset the queues by using ethtool to change the ring parameters of the parent device. This is fantastic for testing as it means I don’t have to reboot, but, unfortunately, it resets all virtual functions at once. That means resetting the NIC for one vm when it is created will reset all of the other vfs on the same box. That’s not very friendly! So I’m going to have to modify the iPXE side to reset properly.

Initially, I try comparing the linux vf code to the iPXE vf code to look for differences, but they are stylistically different enough to make it difficult. I wonder if there is some way for me to view the current setting of the vf registers. Fortunately, Intel provides source for ethregs, a tool that can dump the current register values.

My goal is to get the registers of a NIC after iPXE has reset it to be the same in the before linux and after linux scenarios. With ethregs and diff, i find a difference in the configuration of the second RX queue. The register for the RX queue contains a DMA location. This is where the NIC will write the contents of packets to be read by the driver. On the first boot the second RX queue has the register set to zero — it’s disabled. On the second boot, there is a value there.

Diving into the code, I see that iPXE has only configured a single RX queue. A picture starts to become clear. The linux driver is configuring multiple RX queues and telling the hardware where its receive rings are located in memory. On reboot, iPXE sets the first queue to its receive ring, but the second RX queue is left with a memory location that is no longer in use. The NIC is happily sending some of the packets to the second queue and nothing is reading it!

I prefer to keep the iPXE code simple; it doesn’t need multiple read queues. I therefore add a bit of code to reset all of the RX and TX queues before it does its existing configuration. The first boot with the new code still works perfectly, but the second still fails. Apparently there are more registers that need to be reset.

Through meticulous diffing, I discover a few more differences. The main offending register seems to be one called INTELXVF_PSRTYPE, which the datasheet tells me configures how the packet will be split between header and data. The NIC can do some of the header processing, which linux takes advantage of. This advanced setting just confuses iPXE, which is expecting the whole packet to show up in its queue.

There are also some flags that get changed by linux in the SRRCTL register. These don’t seem likely to be causing problems, but I reset them anyway. With my new changes, the registers in each case are as close as I can get them: there are minor differences in allocated memory locations or counters, but the stable values are identical.

All that remains is to see if the second boot actually works. I switch to my virtual console to check the output from iPXE. For a moment i think it has stalled again, but it makes it through and I’m greeted by grub. Success!

I run through a third and fourth boot for good measure. Everything is smooth. It appears my prototype is complete. I compile a patch for an upstream proposal. I move on to other things. My prototype turns into a system that is almost ready for production. Before we ship, it might be nice to fix that kernel bug that I worked around in Part I. That adventure is detailed in Part III.