So far we have seen how to start up a VM, emulate CUPID instruction, run arbitrary code and learned about segmentation and paging in a CPU. This already allows us to do a lot of useful things with our VM. However, our CPU is in so called real mode. In this article we will go through setting up paging structure and enabling protected mode with paging.
When protected mode is switched on, CPU will remain in segmentation mode. This segmentation in protected mode is not the same as segmentation in real mode (checkout my previous article to learn more about this -> https://hackernoon.com/segmentation-and-paging-in-x86-xts3yp1). Segmentation is not used a lot today and long mode uses paging exclusively.
In this article we will do the following:
Even though program B resides at physical address 0x6000 we expect it to be correctly executed when we execute a jump to 0xc000 from the first program due to paging that will be set up.
Before CPU can start using paging we need to set up what is called page tables. In this exercise we will set up a 2 level page table. In my opinion this is the simplest to set up when starting. After going through this process (and reading a bit of Intel manual) I hope that you will be able to set up 3, 4 or even 5 level page tables.
First level page table is known as page directory (PD) and each entry in that table is known as page directory entry (PDE). PD is 4Kb aligned*.
Address of PD is derived from value stored in register CR3[31:12]. Bits CR3[11:2] are filled bits LA[31:22] while the last two bits are set to zero.
PDE contains a pointer to page table (PT). Elements of page table are called page table entries (PTE). 4Kb aligned page table is located at memory location specified by bits PDE[31:12]. Bits [11:2] are filled with bits LA[21:12] and the last two bits are set to zero (same as previous step).
Each PTE points to a Page Frame. The physical address of that page frame is formed by using bits PTE[31:12] and filling bits [11:0] with zeros.
Following figure depicts the process.
Lower 12 bits in CR3, PDE and PTE are status bits which we can use to configure paging mechanism.
Following is the image that depicts page mapping that we will implement.
On the left hand side are Logical/Virtual addresses while on the right hand size you can see page frames that represent physical memory.
As you read the following sections feel free to refer to the map above to enrich your understanding.
Firstly we will identity map a small region of memory [0x0000 - 0x5fff]. This region will be used to store our PD, PT, stack and some other CPU relevant stuff.
Page tables will be stored in memory region ranging from [0x1000, 0x2fff]. This will ensure that we can access our page tables once we enable paging mode.
Now we know that PD and PT each use 10 bits of the linear address (LA) to select an entry. Using that many bits we can select 2^10 entries. Given each entry is 4 bytes in size that means our PD and each of the PTs that we create will be 4kb in size. If we decide to place those sequentially in memory each of the tables will be 0x1000 (4kb) bytes apart.
We also know that last 12 bits of LA are used to select specific byte (physical address / PA) in a page frame. That tells us that a single page frame can have 2^12 = 4096 bytes = 4kb = 0x1000 .
Each of the PTEs covers 4096 bytes large region of memory. That means we will need three PTEs to cover region of memory from 0x0000 to 0x2fff.
Going backwards we know that PDE can select one of 2^10 = 1024 entries in PD. Each of which is 4bytes in size (Because 10 bits of LA are combined to produce address of PTE and then last two bits are set to 0). Given we need to index only the first PTE from PT we need only one PDE.
We will place our PD at memory location 0x1000 and PT at 0x2000.
Our mapping so far looks like this:
Check the following section to understand why the last digit is 3. Hint - those come from status bits.
The next area that we will map is [0xc000, 0xefff] That is, three blocks of 4Kb in size - 0xc000, 0xd000 and 0xe000. The mapping will be as follows:
1. [0xc000, 0xcfff] -> [0x6000, 0x6fff]
1. [0xd000, 0xdfff] -> [0x7000, 0x7fff]
1. [0xe000, 0xefff] -> [0x8000, 0x8fff]
1. [0xf000, 0xffff] -> [0x9000, 0x9fff]
This mapping is a bit trickier so I made the following digram to demonstrate how mapping from LA 0xc000 to PA 0x3000 works.
From the image we can see that our PTEs should be as follows:
As you can see in the figure, the last digit being 3 in PDE and PTEs is due to setting status bits 0 and 1. Setting those bits tells CPU that the the PTE (for PDE) and page frame (for PTE) are present, readable and writable.
Now that we know what the page tables should look like implementing this is easy:
void* createPageTable(void *mem) {
uint32_t pde = 0x2000 | 0x3;
memcpy(mem, &pde, 4);
uint32_t pte_1 = 0x0000 | 0x3;
memcpy(mem + 0x1000, &pte_1, 4);
uint32_t pte_2 = 0x1000 | 0x3;
memcpy(mem + 0x1004, &pte_2, 4);
uint32_t pte_3 = 0x2000 | 0x3;
memcpy(mem + 0x1008, &pte_3, 4);
uint32_t pte_4 = 0x3000 | 0x3;
memcpy(mem + 0x100c, &pte_4, 4);
uint32_t pte_5 = 0x4000 | 0x3;
memcpy(mem + 0x1010, &pte_5, 4);
uint32_t pte_6 = 0x5000 | 0x3;
memcpy(mem + 0x1014, &pte_6, 4);
uint32_t pte_7 = 0x6000 | 0x3;
memcpy(mem + 0x1030, &pte_7, 4);
uint32_t pte_8 = 0x7000 | 0x3;
memcpy(mem + 0x1034, &pte_8, 4);
uint32_t pte_9 = 0x8000 | 0x3;
memcpy(mem + 0x1038, &pte_9, 4);
uint32_t pte_10 = 0x9000 | 0x3;
memcpy(mem + 0x103c, &pte_10, 4);
return 0;
}
Here we use type uint32_t as it's size is 4 bytes and it's unsigned. The first PDE entry will be placed at the beginning of memory block passed into the function. PTEs will be placed into the same region but offset by 0x1000 bytes.
In previous articles we have seen that host OS needs to allocate some memory for a VM.
void *mem = mmap(NULL, 0x8000, PROT_READ | PROT_WRITE, MAP_SHARED | MAP_ANONYMOUS, -1, 0);
We will use our function
createPageTable
to populate that block of memory with correct data structures:createPageTable(mem);
We will write two programs that we can use to test our mapping. The first program (a.asm) will output character 'A' and be loaded at memory address 0x4000. This area of memory is identity mapped so physical and virtual addresses are the same.
The other program (b.asm) is loaded at virtual memory address 0xc000 which is mapped to physical address 0x6000. We will use jump instruction in prog.a to jump to 0xc000. If we see output of the second program we can say that we set up paging correctly.
Here is our a.asm program:
[BITS 32]
mov ax, 'A'
add ax, '0'
mov dx, 0x3f8
out dx, al
mov eax, 0xc000
jmp eax
And here is our b.asm program:
[BITS 32]
mov ax, 'B'
add ax, '0'
mov dx, 0x3f8
out dx, al
hlt
We will use nasm to compile both of the programs:
nasm -o a.bin a.asm
nasm -o b.bin b.asm
If you are interested in looking at binary code that was generated you can use tools like objdump. I hope I'll write a blog post at some point where I'll disassemble one of this (or similar programs) and go through the generated binary file.
Now that we have files generated, we will need to load them into memory. Here is one way to do this.
First in order for us to know how much memory to allocate we need to find out size of our binaries. We can use host OS's facilities to do so.
off_t get_file_size(int fd) {
struct stat s;
if(fstat(fd, &s) < 0) {
return -1;
}
return s.st_size;
}
We will also need a function to copy those files into memory used by the VM:
int read_into_buffer(int fd, uint8_t *buf) {
uint8_t temp[32];
int r = read(fd, &temp, 32);
int total_bytes_copied = 0;
while(r != 0) {
memcpy(buf + total_bytes_copied, temp, r);
total_bytes_copied += r;
r = read(fd, &temp, 32);
}
return total_bytes_copied;
}
Next, let's open binary files:
int a_fd = open("a.bin", O_RDONLY);
if (a_fd == -1) {
printf("Could not open a.bin.\n");
return -1;
}
int b_fd = open("b.bin", O_RDONLY);
if (b_fd == -1) {
printf("Could not open b.bin.\n");
return -1;
}
And copy them into VM memory:
int bytes_copied = read_into_buffer(a_fd, mem + 0x3000);
if(bytes_copied != a_fs) {
printf("Expected to copy as many bytes as there are in a.bin.\n");
return -1;
}
bytes_copied = read_into_buffer(b_fd, mem + 0x5000);
if(bytes_copied != b_fs) {
printf("Expected to copy as many bytes as there are in b.bin.\n");
return -1;
}
Notice how we have offset files a.bin and b.bin by 0x3000 and 0x5000 bytes respectively. Since this block of memory will be attached to the VM starting at address 0x1000, programs a.bin and b.bin will be found at addresses 0x4000 and 0x6000 from the perspective of VM.
struct kvm_userspace_memory_region region = {
.slot = 0,
.guest_phys_addr = 0x1000,
.memory_size = 0x8000,
.userspace_addr = (uint64_t)mem,
};
ioctl(vmfd, KVM_SET_USER_MEMORY_REGION, ®ion);
There are two more pieces of puzzle that we need to address:
CPU looks in register CR3 for the address of page tables. This is a special register and its value can be retried and set using
KVM_GET_SREGS
and KVM_SET_SREGS
ioctl pair of commands.Protected mode for a CPU is enabled by setting bit 0 in CR0 register while paging is enabled by setting bit 31 in the same register.
struct kvm_sregs sregs;
ioctl(vcpufd, KVM_GET_SREGS, &sregs);
sregs.cr3 = 0x1000;
sregs.cr0 = sregs.cr0 | 0x80000001;
ioctl(vcpufd, KVM_SET_SREGS, &sregs);
Finally, set instruction pointer (IP/rip) to 0x4000 (location of a.bin program):
struct kvm_regs regs = {
.rip = 0x4000,
};
ret = ioctl(vcpufd, KVM_SET_REGS, ®s);
And we are ready to kick off our VM:
...
ioctl(vcpufd, KVM_RUN, NULL)
...
And hopefully you'll be greeted with the following output:
Complete source code is available at GitLab: https://gitlab.com/mvuksano/kvm-playground/-/tree/master/04-protected-mode-with-paging. Check out the README on how to run the code if you get stuck.
And that's it! You reached the end of this article. I hope that by now you have better understanding of what it takes to configure CPU for what most engineers take for granted.
In the upcoming articles I will talk more about compiling assembly programs for different architectures (16/32/64) and configuring CPU to work in 64 bit mode. Stay tuned.