Unicorn Mode as demonstrated in my previous article is not overly useful on the surface. It has lots of limitations that make it clumsy and slow to employ against most real-world situations. For example, what if the binary you want to emulate calls an imported library function that is likely to call into the kernel such as malloc() or printf()? What if the code you want to fuzz is highly stateful, and requires lots of memory regions (heap allocations, stack pointers, global variables, etc.) that aren’t known until run-time? In fact, just about the only straightforward use of it that I’ve found is when working with flat embedded run-time system memory snapshots recovered from firmware by a debugger.
This article introduces some new tools and techniques that myself and my co-worker Parker Wiksell developed in order to apply afl-unicorn to Windows, Linux, Android, and iOS applications.
Afl-unicorn bridges the gap between the thoroughness of fully manual research (i.e. reading disassembly/source) and the unmatched ease-of-use of AFL. With a little bit of reverse engineering and setup time, afl-unicorn lets you leverage the power of AFL to rapidly discover vulnerabilities in parts of code that you know are suspicious and have a basic grasp of what they do.
Maybe you’re asking yourself, “if I need to do a bunch of reverse engineering, why would I bother taking the time to get afl-unicorn up and running”? In my case it was an easy decision: I consider myself a pretty decent reverse engineer, but I don’t trust my ability to spot all the vulnerabilities in the code I’m reversing. I’ve found myself missing out-of-bounds memory accesses, integer overflows, etc. in the past, and so I’d rather rely on AFL’s mutation engine to do the bug-hunting for me. Also, if you’re going down the path of manual analysis you are going to do the RE legwork anyway, so taking a day or two to spin up afl-unicorn and then letting it run in the background while you keep sifting through code provides bonus, low-cost coverage.
The General Workflow
While the original blog post described the basic mechanics behind how afl-unicorn worked and provided a toy example, this post aims to provide a more real-world way to use it against an application that runs on an operating system (such as Windows, Linux, Android, or iOS). In reality, you’ll really want to understand what afl-unicorn does and adapt it to your specific problem to make sure that you don’t get (or know how to identify) false positives and negatives.
The first task is to reverse engineer some basic knowledge about the code you want to fuzz. This includes identifying a good starting and ending point and how that code receives the input that you are going to mutate.
Let’s say that you identify a top-level parsing function for a network packet. Does the function take the packet off the wire as a parameter? How is that passed into the function? Most likely, this will be either through a globally-allocated buffer, a pointer on the stack, or a pointer in a register.
You’ll also want to figure out (as best as you can) what constraints there are on the input. For example, what is the maximum size? Are there any invalid characters? Make sure you think about this in the context of the selected starting address, since these constraints can change over time as the code filters invalid inputs out and allocates buffers of various sizes throughout input handling.
Once all that research is done you’ll want to capture a snapshot of the process at the starting address while handling a valid input. We’ve accomplished this by creating a series of ‘Unicorn Context Dumper’ scripts which, when run while sitting at a debugger breakpoint at the start address, save the entire process memory, register values, and architecture information to a ‘Context Directory’ on disk.
Now you need to write a Unicorn script which loads the process context that you dumped out, loads input to be fuzzed with data read from a file on disk, and emulates from the start to the ending address. If any errors or crashes are detected during emulation this script must forcibly crash itself so that AFL will be able to detect it. We’ve created a set of helper utilities we call the ‘Unicorn Loader’ module that makes most of these tasks easy. The ‘Unicorn Loader’ also includes a full stand-in heap manager which is useful for preventing emulation errors which happen when emulating typical OS applications…More on this later.
Once your Unicorn test harness script can successfully emulate from the start to the end address (and does all the other stuff mentioned above), it’s time to create some valid, non-crashing sample inputs and run it under afl-unicorn as described in the first blog post. With any luck, you’ll see paths being discovered and hopefully some crashes!
A Concrete Example: CGC’s FSK_Messaging_Service
Description of the Example Target Application
Trail of Bits recently released cb-multios, which contains the challenges from DARPA’s Cyber Grand Challenge with additional support libraries that make them easy to compile and run on Linux. In this example I’ll demonstrate how afl-unicorn can be used to fuzz the parsing function of one of the challenges that was specifically designed to be difficult to fuzz, FSK_Messaging_Service:
[…] a service that implements a packet radio receiver with included FSK demodulation front-end, packet decoding, processing, and finally parsing it into a simple messenging service.
The FSK_Messaging_Service challenge was specifically designed to be challenging to fuzz. While the underlying vulnerabilities are fairly straightforward, the bugs exist after extensive parsing and demodulation of simulated analog RF input. In addition, the data itself has a simple 16-bit checksum appended that must be verified before full parsing is performed. From the description of the challenge:
This [challenge binary] presents a number of challenges to a computer reasoning system. The difficulty lies in the transformation of the input set into the processed data after the RF front-end. Due to its very nature fuzzing will be ineffective as RF receivers are naturally subjected to noise and are particularly well suited to identifying signals in the presence of noise[…] This [challenge] is therefore subjectively considered to be hard and designed to test beyond state of the art input reasoning capabilities and solvers.
Here’s a diagram showing the overall logic and data flow of the FSK_Messaging_Service application:
Finding What We Need to Emulate and Fuzz the Target Code
OK. So we can’t fuzz the front-door interface, but with a little bit of analysis of the code (or disassembly if we didn’t have the source) it’s pretty easy to find the function that intuition tells us is most likely to have bugs: cgc_receive_packet() found in packet.c. This function is fairly simple, and does the following:
- Verify the packet buffer is not null and its length is greater than 0
- Validate the packet contents by computing and comparing the 16-bit CRC
- Loop over the packet types, and if one matches call cgc_add_new_packet()
- If a valid packet type is found, cgc_add_new_packet() instantiates a tSinglePacketData structure and copies information from the packet
Here’s a slightly doctored up collection of snippets from the source code showing the relevant parts:
Of course in reality you are very likely to not have the source code available to you. Instead, you’ll have to use traditional reverse engineering methods (static and dynamic analysis) to learn all of the necessary information about the target application.
So now we have our fuzzing target and knowledge of how the input is given:
- We want to fuzz from the cgc_receive_packet() function
- Input is passed into the function in the form of 3 parameters: a pointer to the packet data (uint8_t *pData), a corresponding length (uint8_t dataLen), and a checksum of the data (uint16_t packetCRC)
We also know a simple constraint on the input:
- The maximum data length is 256 bytes since dataLen is an 8-bit value
Dumping a Valid Running Process Context
Now we’ll want to get snapshot of the entire process memory as this function is being called to make emulation as simple as possible. This may seem like a heavy-handed approach as opposed to simpler methods (such as the in-place emulation offered by ripr or uEmu), it solves a ton of problems. For example, the global list cgc_g_packetHandlers is populated at run-time so unless we have a run-time state of its memory location the for-loop which iterates over the handlers in cgc_packet_receive() would fail during emulation.
Dumping the entire process memory state and register context is done using a ‘Unicorn Context Dumper’ script. We’ve created several different versions supporting different debuggers including IDA Pro (prior to version 7 for now), LLDB, and GDB with GEF. At the moment only the IDA version is available, but the rest (and any others that get created for other debuggers) will be pushed to GitHub as soon as they are ready. Simply attach IDA Pro’s debugger to a running FSK_Message_Service process, hit a breakpoint at the fuzzing start address, and run the script through IDA (File->Script File…). Note that I’ve only tested this with IDA’s built-in remote debug server. Attaching to other debuggers may present the memory segments differently which would likely lead to errors.
I’ve chosen to set my starting address just before the call to cgc_packet_receive() at a spot where the parameters are conveniently all in registers instead of on the stack. This makes it a little easier to modify them in my Unicorn emulation script.
Once the script completes it generates a ‘Unicorn Context’ directory in the same folder as the IDA database (.idb). This directory contains two things:
- _index.json: A JSON-formatted file containing metadata about all memory segments in the process, register state, and architecture information
- Lots of gzip-compressed binary files containing the contents of each individual memory segment in the process
Creating a Fuzzable Unicorn Test Harness
Now that we have a starting context to begin emulation from, we write a Unicorn script which loads the context (map all memory regions, load content into them, and set register contents), hooks anything that will break emulation or will impede fuzzing (malloc(), free(), checksum verification, etc.), inlays a new packet into the appropriate places, and emulates the code from start to finish. I’ve created a bare-bones template as an example test harness to start from.
Fast-forwarding a bit, shown below is the complete Unicorn script which can emulate the FSK_Message_Service application’s application-layer packet parsing starting from an initial state loaded from the context directory generated from the Unicorn context dumper. This script relies heavily on functionality imported from the unicorn_loader.py module provided with afl-unicorn. We’ll go over some of the more interesting bits below, but for the most part this follows the basic steps discussed in my previous blog post.
This script has a few unique parts in it that make emulation and fuzzing possible. Each is described in detail below:
Instantiation of Unicorn Engine Instance from Dumped Context: The unicorn_loader.py module provides a new AflUnicornEngine class which derives from the normal UnicornEngine. The constructor takes 3 arguments: path to the context directory, a flag to enable tracing output on STDOUT, and a flag to enable debug output while loading the context to STDOUT.
The AflUnicornEngine class also provides a few additional APIs that are useful in fuzzing test harnesses:
- dump_regs(): Dumps current register contents to STDOUT
- force_crash(e): Forces a crash of the test harness by issuing a signal (SIGILL, SIGSEGV, SIGABRT, etc.). This let’s AFL detect that a crash occurred and log appropriately. You must call this if a crashing condition (such as as emu_start() throwing an exception) occurs!
Because this class derives from the base UnicornEngine class you can still use all the normal calls, such as emu_start(), reg_read(), and mem_write(). To see all of the APIs available on the AflUnicornEngine class read through the unicorn_loader module’s source code.
Hooking all heap allocations (malloc()): Calling malloc() during emulation can cause all sorts of problems. It’s possible that the allocator will need to ask the kernel for more memory, but during emulation we there is no such thing as the kernel…so that would result in a crash. In order to prevent this, the Unicorn script hooks any call to malloc(), and instead calls a Unicorn-based implementation that is provided with afl-unicorn in the unicorn_loader.py module. The snippet below shows the code used to do this for the FSK_Messaging_Service binary, which is a 32-bit Linux binary.
In line 45 the number of bytes is retrieved from the stack. Line 46 calls the internal, Unicorn-based implementation. Line 47 puts the return value (the address of the buffer that was allocated) into EAX, and lines 48 and 49 manually perform a ‘return’ by setting EIP to the return address and then popping the return address off of the stack. All of this is in accordance with typical x86 calling conventions. When adapting this approach to your own binary, make sure that you follow the calling conventions for your given operating system and architecture!
Another major benefit of handling memory allocation ourselves is that we can implement our own rudimentary guard pages. Basically, all allocated buffers are surrounded by ‘guard pages’ which have no read or write permissions. Any access outside of the bounds of the returned buffer (AKA a heap overflow or underflow) will crash immediately with a memory access violation.
Note that the UnicornSimpleHeap class in the unicorn_loader.py module provides free(), calloc(), and realloc() functionality as well, but for simplicity I’ve chosen to only hook malloc() in this example. For emulating larger, longer-running, and more complex code you will probably want or need to hook all of the heap-related functions.
Skipping unnecessary, hard to emulate functions: There are many other things that will obviously cause issues. Printf(), for instance, will surely call into the kernel in order to send the text to be printed to the graphics device for rendering. You’ll want to analyze the code that you’re trying to emulate and work hard to identify anything that you think is likely to break emulation. In this example case, I’ve determined that free(), printf(), and cgc_transmit() will cause emulation to fail for various reasons, and also that I can also skip them without any major consequences to fuzzing results. All of these functions are skipped by forcing an immediate return. This is accomplished the same way as the final part of the malloc() hook described above: Manually set EIP to the return address stored on the stack, then pop the return address off the stack by adding 4 to ESP. Remember that this exact process is specific to x86, so adjust as necessary for your target architecture.
Bypassing the checksum validation: Each received packet is accompanied by a 16-bit CRC that must be validated before the packet is verified (refer to lines 27–31 from the source snippet earlier in the article). This alone presents a major challenge for traditional fuzzing, as any attempt to blindly modify packets will result in a failed CRC check and almost no code coverage. This type of problem is well known, but traditionally it requires patching of the target binary or development to correctly generate a valid checksum for each input.
Afl-unicorn makes bypassing this fairly trivial. For this example the checksum validation was very easy to identify in IDA:
We simply hook the call to cgc_simple_checksum16(), and anytime execution gets there EIP is manually set to the ‘CRC check passes’ path:
This doesn’t prevent us from having to figure out how to calculate the CRC later in order to develop a full working exploit, but it lets us push that work down the line and instead focus on finding vulnerabilities first.
Emulating one instruction before loading mutated input: Here’s the weirdest part, and it’s really just an artifact of how I’ve instrumented AFL into Unicorn because I haven’t come up with a solution to the real internal problem yet: In order to make sure that AFL’s forkserver starts up at the right time, you MUST emulate at least 1 instruction before you load the mutated input from disk. If you don’t, then every fork that AFL creates will execute with the same input. In the example script this is done between lines 82 and 87:
So basically you just need this block of code in your test harness somewhere before you load the mutated input. There is one nuance, though: You need to ask yourself if re-executing the first instruction has any negative consequences. In this example the first instruction executed is a harmless ‘mov [esp],ecx’, so re-executing it doesn’t have any negative consequences. If you don’t want to or can’t afford to re-execute the first instruction, simply adjust the starting address appropriately the second time you start emulation (uc.emu_start()).
Fuzzing the the Emulated Binary with afl-unicorn
With the Unicorn harness complete the only thing left to do is run it under afl-unicorn and hope that it finds some crashes. For a detailed breakdown of how to run afl-unicorn make sure you read my previous blog post, but for this specific instance we just run the typical afl-unicorn command line:
afl-fuzz -U -m none -i /path/to/inputs/ -o /path/to/results/ -- python fsk_message_service_test_harness.py /path/to/context_dir/ @@
Sure enough, it found the vulnerability in the cgc_packet_receive() function described in the challenge’s README:
Upon reception of a packet that exceeds the maximum packet size of 64-bytes improper length checking is done for the memcpy to the newly allocated packet structure. This allows a memory overwrite to occur on the heap. This data structure has a function pointer to the packet handler that can be overwritten and once the service executes this function pointer there is an opportunity for control flow execution by overwriting this function pointer.
It’s obvious from dumping the crashing input file that the packet is too large (>48 bytes) for the packetData buffer allocated in the tSinglePacketData structure:
We can then verify this by running the Unicorn script with the crashing input:
The next step would be to figure out how to send this crashing input into the actual (non-emulated) application and prove that it is a true, working crash and that it does not stem from an emulation error.
Debugging Emulation-based Fuzzing Issues
Some common problems that I’ve encountered include:
- No paths are discovered: Make sure you emulate at least one instruction before loading the mutated input. If you don’t, every fork gets the same unchanged input. Run your test harness on its own outside of afl-unicorn and make sure that it runs from the start to end address without any problems. If that doesn’t fix it, make sure that the mutated inputs are being written into the emulated memory and register context correctly.
- Way too many crashes are found: Either you’ve stumbled onto some really buggy code (jackpot!), or there are emulation issues. Follow the emulation debug trace output and look for things that are breaking emulation such as segment register-based dereferences, system calls into the kernel, or dynamic module loading.
If things look good (new paths are being discovered fairly regularly), then the everything else follows typical AFL use patterns. Make sure your sample inputs gets good coverage of the target code and fuzz to your heart’s content.
Where We Are and Where We’re Going
In this post I’ve demonstrated an example of how we’ve been using afl-unicorn to fuzz hard-to-reach interfaces of real-world applications. We’ve found this methodology to be very effective on Windows, Linux, Android, and iOS applications, and I assume that it would be easily portable to embedded systems.
The outstanding tasks are mainly to continue using the scripts that make this methodology usable and expand them to additional operating systems and architectures. For example, emulating Windows applications introduces a long list of issues as references to the PEB and TIB cause false crashes because of references to the GS segment register. Operating-system specific utilities could be created (in a similar manner as the UnicornSimpleHeap class that’s already in the unicorn_loader.py module) to handle these known cases with minimal instrumentation. This would be very similar to the route taken by the usercorn project. In addition, the ripr project is extremely interesting and I believe that there is a good possibility that their code-generation methods could be adapted or extended to generate a template test harness that would be very easy to make fuzzable.
In a future blog post I plan to demonstrate using afl-unicorn against a flat run-time memory image retrieved from an embedded system. That use case was the original inspiration for creating afl-unicorn, and I still believe it is the ideal environment as it avoids most of the problems introduced when trying to emulate a userland application running in a more sophisticated, multi-threaded OS.
I developed afl-unicorn and the methodology described here as an internal research project in collaboration with Parker Wiksell at Battelle in Columbus, OH. Battelle is an awesome place to work, and afl-unicorn is just one of many examples of novel cyber security research being done there. For more Battelle-sponsored projects, check out Chris Domas and John Toterhi (AKA cetfor)’s previous work. For information about careers at Battelle, check out their careers page.
Of course none of this would be possible without AFL and Unicorn Engine. Lots of additional inspiration came from Alex Hude‘s awesome uEmu plugin for IDA, and many of the general concepts were borrowed from the NCC Group’s AFLTriforce project. A bunch of additional inspiration was pulled from the usercorn project, as it proved that Unicorn can be successfully made to run user-space applications.