451 reads

Enabling Paging for a Virtual Machine

by Marko VuksanovicJuly 11th, 2020

Too Long; Didn't Read

In this article we will go through setting up paging structure and enabling protected mode with paging. This is not the same as segmentation in real mode. When paging mode is switched on, CPU will remain in segmentation mode. This segmentation is not used a lot today and long mode uses paging exclusively. After going through this process (and reading a bit of Intel manual) I hope that you will be able to set up 3, 4 or even 5 level page tables.

Companies Mentioned

featured image - Enabling Paging for a Virtual Machine

So far we have seen how to start up a VM, emulate CUPID instruction, run arbitrary code and learned about segmentation and paging in a CPU. This already allows us to do a lot of useful things with our VM. However, our CPU is in so called real mode. In this article we will go through setting up paging structure and enabling protected mode with paging.

When protected mode is switched on, CPU will remain in segmentation mode. This segmentation in protected mode is not the same as segmentation in real mode (checkout my previous article to learn more about this -> https://hackernoon.com/segmentation-and-paging-in-x86-xts3yp1). Segmentation is not used a lot today and long mode uses paging exclusively.

In this article we will do the following:

Configure CPU for paging
Load bootstrap program (program A) at memory location 0x4000
Load another program (program B) at memory location 0x6000 which will be mapped to virtual address 0xc000
Switch CPU to protected mode with paging enabled starting execution in program A and doing a jump to 0xc000

Even though program B resides at physical address 0x6000 we expect it to be correctly executed when we execute a jump to 0xc000 from the first program due to paging that will be set up.

1. Configure CPU for paging

Before CPU can start using paging we need to set up what is called page tables. In this exercise we will set up a 2 level page table. In my opinion this is the simplest to set up when starting. After going through this process (and reading a bit of Intel manual) I hope that you will be able to set up 3, 4 or even 5 level page tables.

First level page table is known as page directory (PD) and each entry in that table is known as page directory entry (PDE). PD is 4Kb aligned*.

Address of PD is derived from value stored in register CR3[31:12]. Bits CR3[11:2] are filled bits LA[31:22] while the last two bits are set to zero.

PDE contains a pointer to page table (PT). Elements of page table are called page table entries (PTE). 4Kb aligned page table is located at memory location specified by bits PDE[31:12]. Bits [11:2] are filled with bits LA[21:12] and the last two bits are set to zero (same as previous step).

Each PTE points to a Page Frame. The physical address of that page frame is formed by using bits PTE[31:12] and filling bits [11:0] with zeros.

Following figure depicts the process.

Lower 12 bits in CR3, PDE and PTE are status bits which we can use to configure paging mechanism.

1.1. Game plan

Following is the image that depicts page mapping that we will implement.
On the left hand side are Logical/Virtual addresses while on the right hand size you can see page frames that represent physical memory.

As you read the following sections feel free to refer to the map above to enrich your understanding.

1.1 Identity map a region of memory

Firstly we will identity map a small region of memory [0x0000 - 0x5fff]. This region will be used to store our PD, PT, stack and some other CPU relevant stuff.

Page tables will be stored in memory region ranging from [0x1000, 0x2fff]. This will ensure that we can access our page tables once we enable paging mode.

Now we know that PD and PT each use 10 bits of the linear address (LA) to select an entry. Using that many bits we can select 2^10 entries. Given each entry is 4 bytes in size that means our PD and each of the PTs that we create will be 4kb in size. If we decide to place those sequentially in memory each of the tables will be 0x1000 (4kb) bytes apart.

We also know that last 12 bits of LA are used to select specific byte (physical address / PA) in a page frame. That tells us that a single page frame can have 2^12 = 4096 bytes = 4kb = 0x1000 .

Each of the PTEs covers 4096 bytes large region of memory. That means we will need three PTEs to cover region of memory from 0x0000 to 0x2fff.

Going backwards we know that PDE can select one of 2^10 = 1024 entries in PD. Each of which is 4bytes in size (Because 10 bits of LA are combined to produce address of PTE and then last two bits are set to 0). Given we need to index only the first PTE from PT we need only one PDE.

We will place our PD at memory location 0x1000 and PT at 0x2000.

Our mapping so far looks like this:

CR3 = 0x1000
[0x1000] = 0x2003 (at location 0x1000, which is our PD we have value 0x2000 which is location of our PT). This is our only PDE.
[0x2000] = 0x0003
[0x2004] = 0x1003
[0x2008] = 0x2003
[0x200c] = 0x3003
[0x2010] = 0x4003
[0x2014] = 0x5003

Check the following section to understand why the last digit is 3. Hint - those come from status bits.

1.2 Map "program region"

The next area that we will map is [0xc000, 0xefff] That is, three blocks of 4Kb in size - 0xc000, 0xd000 and 0xe000. The mapping will be as follows:

1. [0xc000, 0xcfff] -> [0x6000, 0x6fff]

1. [0xd000, 0xdfff] -> [0x7000, 0x7fff]

1. [0xe000, 0xefff] -> [0x8000, 0x8fff]

1. [0xf000, 0xffff] -> [0x9000, 0x9fff]

This mapping is a bit trickier so I made the following digram to demonstrate how mapping from LA 0xc000 to PA 0x3000 works.

From the image we can see that our PTEs should be as follows:

[0x2030] = 0x6003
[0x2034] = 0x7003
[0x2038] = 0x8003
[0x203c] = 0x9003

As you can see in the figure, the last digit being 3 in PDE and PTEs is due to setting status bits 0 and 1. Setting those bits tells CPU that the the PTE (for PDE) and page frame (for PTE) are present, readable and writable.

2. Implementation

Now that we know what the page tables should look like implementing this is easy:

void* createPageTable(void *mem) {
        uint32_t pde = 0x2000 | 0x3;
        memcpy(mem, &pde, 4);

        uint32_t pte_1 = 0x0000 | 0x3;
        memcpy(mem + 0x1000, &pte_1, 4);

        uint32_t pte_2 = 0x1000 | 0x3;
        memcpy(mem + 0x1004, &pte_2, 4);

        uint32_t pte_3 = 0x2000 | 0x3;
        memcpy(mem + 0x1008, &pte_3, 4);

        uint32_t pte_4 = 0x3000 | 0x3;
        memcpy(mem + 0x100c, &pte_4, 4);

        uint32_t pte_5 = 0x4000 | 0x3;
        memcpy(mem + 0x1010, &pte_5, 4);

        uint32_t pte_6 = 0x5000 | 0x3;
        memcpy(mem + 0x1014, &pte_6, 4);

        uint32_t pte_7 = 0x6000 | 0x3;
        memcpy(mem + 0x1030, &pte_7, 4);

        uint32_t pte_8 = 0x7000 | 0x3;
        memcpy(mem + 0x1034, &pte_8, 4);

        uint32_t pte_9 = 0x8000 | 0x3;
        memcpy(mem + 0x1038, &pte_9, 4);

        uint32_t pte_10 = 0x9000 | 0x3;
        memcpy(mem + 0x103c, &pte_10, 4);

        return 0;
}

Here we use type uint32_t as it's size is 4 bytes and it's unsigned. The first PDE entry will be placed at the beginning of memory block passed into the function. PTEs will be placed into the same region but offset by 0x1000 bytes.

In previous articles we have seen that host OS needs to allocate some memory for a VM.

void *mem = mmap(NULL, 0x8000, PROT_READ | PROT_WRITE, MAP_SHARED | MAP_ANONYMOUS, -1, 0);

We will use our function

createPageTable

to populate that block of memory with correct data structures:

createPageTable(mem);

3. Executable programs

We will write two programs that we can use to test our mapping. The first program (a.asm) will output character 'A' and be loaded at memory address 0x4000. This area of memory is identity mapped so physical and virtual addresses are the same.

The other program (b.asm) is loaded at virtual memory address 0xc000 which is mapped to physical address 0x6000. We will use jump instruction in prog.a to jump to 0xc000. If we see output of the second program we can say that we set up paging correctly.

Here is our a.asm program:

[BITS 32]
	mov ax, 'A'
	add ax, '0'
	mov dx, 0x3f8
	out dx, al
	mov eax, 0xc000
	jmp eax

And here is our b.asm program:

[BITS 32]
	mov ax, 'B'
	add ax, '0'
	mov dx, 0x3f8
	out dx, al
	hlt

We will use nasm to compile both of the programs:

nasm -o a.bin a.asm
nasm -o b.bin b.asm

If you are interested in looking at binary code that was generated you can use tools like objdump. I hope I'll write a blog post at some point where I'll disassemble one of this (or similar programs) and go through the generated binary file.

4. Loading binary files into memory

Now that we have files generated, we will need to load them into memory. Here is one way to do this.

First in order for us to know how much memory to allocate we need to find out size of our binaries. We can use host OS's facilities to do so.

off_t get_file_size(int fd) {
  struct stat s;
  if(fstat(fd, &s) < 0) {
	  return -1;
  }
  return s.st_size;
}

We will also need a function to copy those files into memory used by the VM:

int read_into_buffer(int fd, uint8_t *buf) {
	uint8_t temp[32];
	int r = read(fd, &temp, 32);
	int total_bytes_copied = 0;
	while(r != 0) {
		memcpy(buf + total_bytes_copied, temp, r);
		total_bytes_copied += r;
		r = read(fd, &temp, 32);
	}
	return total_bytes_copied;
}

Next, let's open binary files:

int a_fd = open("a.bin", O_RDONLY);
if (a_fd == -1) {
        printf("Could not open a.bin.\n");
        return -1;
}

int b_fd = open("b.bin", O_RDONLY);
if (b_fd == -1) {
        printf("Could not open b.bin.\n");
        return -1;
    }

And copy them into VM memory:

int bytes_copied = read_into_buffer(a_fd, mem + 0x3000);
if(bytes_copied != a_fs) {
        printf("Expected to copy as many bytes as there are in a.bin.\n");
        return -1;
}

bytes_copied = read_into_buffer(b_fd, mem + 0x5000);
if(bytes_copied != b_fs) {
        printf("Expected to copy as many bytes as there are in b.bin.\n");
        return -1;
}

Notice how we have offset files a.bin and b.bin by 0x3000 and 0x5000 bytes respectively. Since this block of memory will be attached to the VM starting at address 0x1000, programs a.bin and b.bin will be found at addresses 0x4000 and 0x6000 from the perspective of VM.

struct kvm_userspace_memory_region region = {
    .slot = 0,
    .guest_phys_addr = 0x1000,
    .memory_size = 0x8000,
    .userspace_addr = (uint64_t)mem,
};

ioctl(vmfd, KVM_SET_USER_MEMORY_REGION, &region);

5. Running CPU in protected mode with paging enabled

There are two more pieces of puzzle that we need to address:

tell vCPU where to find page tables
tell vCPU to start in protected mode with paging enabled.

CPU looks in register CR3 for the address of page tables. This is a special register and its value can be retried and set using

KVM_GET_SREGS

and

KVM_SET_SREGS

ioctl pair of commands.

Protected mode for a CPU is enabled by setting bit 0 in CR0 register while paging is enabled by setting bit 31 in the same register.

struct kvm_sregs sregs;
ioctl(vcpufd, KVM_GET_SREGS, &sregs);

sregs.cr3 = 0x1000;
sregs.cr0 = sregs.cr0 | 0x80000001;

ioctl(vcpufd, KVM_SET_SREGS, &sregs);

6. Run VM

Finally, set instruction pointer (IP/rip) to 0x4000 (location of a.bin program):

struct kvm_regs regs = {
    .rip = 0x4000,
};
ret = ioctl(vcpufd, KVM_SET_REGS, &regs);

And we are ready to kick off our VM:

...
ioctl(vcpufd, KVM_RUN, NULL)
...

And hopefully you'll be greeted with the following output:

Complete source code is available at GitLab: https://gitlab.com/mvuksano/kvm-playground/-/tree/master/04-protected-mode-with-paging. Check out the README on how to run the code if you get stuck.

Conclusion

And that's it! You reached the end of this article. I hope that by now you have better understanding of what it takes to configure CPU for what most engineers take for granted.

In the upcoming articles I will talk more about compiling assembly programs for different architectures (16/32/64) and configuring CPU to work in 64 bit mode. Stay tuned.