Today we will begin our journey into the basics of hacking. Let’s not waste any time.
Here is the source code of a simple C program:
If we compile and execute the code it prints “Hello World” ten times:
It is worth emphasizing that C- source code needs to be compiled into a binary file before it can be executed and do anything. By default, the binary file will be named “a.out.” However, we can specify a name for the binary file by writing “-o [name]” as we did above.
If we were to simply open the binary file “first” with the Linux utility “less” we would see the following:
This is only a snippet of a long machine language file
This file contains machine language — an elementary language that can be understood directly by the CPU. The role of your compiler is to translate the C — source code into machine language that your processor architecture (the target architecture) can understand. In this article we are using a x86 processor architecture.
Let’s take a closer look at our compiled binary file “first” using the Linux utility “objump”: objdump -D -M intel first | grep -A20 main:
Objdump converts machine language instructions into human readable assembly instructions. Just as machine language instructions are unique to a given processor architecture, assembly language instructions are also unique to the given processor architecture. The “-D” flag tells objdump to disassemble all. The “-M intel” flag tells objdump to use intel syntax for the assembly language. Intel syntax (oppose to at&t syntax) does not contain the cacophony of % and $ signs and I find it easier to read. We can make intel syntax the default for objdump by adding “set disassembly intel” to our .gdbinit file in our home directory.
The long number next to “<main>” is a memory address denoting where this function lives in the computer’s memory. The column of numbers, each beginning with “4,” are memory addresses which show us the location of each machine language instruction. The various numbers in the middle column (55 is on top) are machine language instructions that only the CPU can understand. The right most column are assembly language instructions. Assembly is just a way for programmers to represent the machine language instructions. Typically, intel syntax follows a specific format:
<Operation> <Destination>, <Source>
Example:
mov rbp,rsp
The “destination” or “source” values will either be registers, a memory address, or a value.
Registers are like internal variables for the processor, they allow the processor to read and write data efficiently, do simple math, etc.
We can examine the values of the registers before the program runs using gdb (a portable debugger written in c):
The “-q” flag just prevents the gdb from printing the copywrite information.
The first four registers: rax (accumulator), rbx (base), rcx (counter), and rdx (data) are general purpose registers. They are mainly used as temporary variables by CPU when executing machine language instructions.
The second four registers: rsi, rdi, rbp, rsp are additional general-purpose registers but are sometimes known as pointers and indices. They stand for Stack Pointer, Base Pointer, Source Index, and Destination Index. RSP and RBP are called pointers because they store addresses. RSI and RDI are also pointers used to point to the source and destination when data needs to be read from or written to.
The RIP is the Instruction Pointer register which points to the current instruction the processor is reading. RIP is very important. The remaining eflags register consists of several bit flags that are used for comparisons and memory segmentations.
Let’s use the debugger to step through our hello world program:
Notice that we have access to the source code from inside the debugger. In order to do this, you must compile the source file with the “-g” flag. Being able to view the source code from inside the debugger will help us keep track of things. We then disassemble the main function using “disass main” (we will use abbreviations when we can). This stops the program before any of the instructions are actually run. We then use “i r rip” to examine the value of the Instruction Pointer. In this case, “i” is short for “info” and “r” is short for “register.”
Notice that rip contains a memory address that points to an instruction in the main() functions disassembly (the second mov instruction). All of the functions before this are collectively known as the “function prologue” and are “generated by the compiler to set up memory for the rest of the main() function’s local variables.”
We can examine memory directly in gdb by typing “x” (for “examine”) and specifying two arguments: the location of memory to examine, and how to display that memory. There are four different display formats: o (octal), x (hexadecimal), u (standard base-10), t (binary).
The default size of a single unit is a four-byte word. The size of the display units for the examine command can be changed by adding a size letter to the end of the format letter. Size letters are as follows: b (a single byte), h (a halfword), w (a word), g (a double word):
You may notice that the bytes are being reversed. This is because values are stored in little-endian order on the x86 processor which means the least significant byte is stored first. In other words, “if four bytes are to be interpreted as a single value, the bytes must be used in reverse order” (Hacking: The Art of Exploitation, Jon Erickson).
The examine command can also display memory as disassembled assembly language instructions:
In this context, “i” stands for “instruction.” This instruction moves the value of “0” into a location in memory that is four less than where the register rbp lives:
If we disassemble the main function we see that we are on this instruction.
If we run the command “nexti,” the current instruction will be executed and we will move on to the next instruction:
We are now at the jump instruction.
If we examine the value at the address of rbp — 4 we will find that it has been zeroed out:
We zero out four bytes because an integer in C requires four bytes.
The next instructions make more sense to discuss as a group:
The current instruction is an unconditional jump to 0x40054c. This takes us to a “cmp” instruction which represents the control flow of the loop (is i less than 9). If it is, we jump to location 0x40053e which is another “mov” instruction. Let’s use the nexti command to execute the mov instruction and then inspect the edi register:
You may notice that all each of these hexadecimal bytes are represented on the ascii table. We can lookup a byte on the ascii table by using “c”:
So, the mov instruction loads the string “Hello, World” so that the subsequent call instruction can print the string.
This concludes our first dive into the basics of hacking an examining memory.