How to avoid latency spikes and memory consumption spikes during snapshotting in an in-memory…

Remember my recent article “What an in-memory database is and how it persists data efficiently”?

In that article, I summarized the mechanisms that an in-memory database adopts to persist data. I was talking about two mechanisms: transaction logging and snapshotting. I covered transaction logging (only in general, though) and touched snapshotting a bit. So, in this article I’m going to elaborate on snapshotting. I’ll start with a simple, quick & dirty implementation of snapshotting in an in-memory database, point out some problems of that and then dive deeply into the implementation of snapshotting in Tarantool.

Ok. We have an in-memory database. Everything is in memory. Right? As I mentioned in my previous article, to snapshot it, we just need to write everything to a file. This means that we need to iterate through all the tables and through all the rows of all the tables and dump everything to disk in a single file via the ‘write’ syscall. Sounds simple. But the problem here is that the database is being changed. Even if we lock data structures as we go, we can end up with an inconsistent state on disk.

How to make it consistent? The easiest (and the dirtiest) way is to lock the whole database in advance, dump it to a file and unlock it. Well, this will work. A database can be locked for a long time. For example, if the size of the dataset is 256Gb, then with an HDD disk and its peak performance of 100Mb per second, the snapshotting will last for 256Gb/100Mb that equals (roughly) 2560 seconds or (roughly) 40 minutes. The database will still respond to queries but it will be all locked for update operations for 40 minutes. Dude, are you serious? (you’re saying right now). Well, 40 minutes of update downtime for a database, say, per day means 97% uptime in the very best case scenario (which, frankly, will never happen because there are numerous of other conditions to make your database go down).

Any ideas? Let’s look closer at what is happening here. We locked the whole dataset only because we had to copy it to a slow device. What if we sacrifice memory to improve the speed? I mean, what if we copy the whole dataset in a separate place in memory and then dump it to a slow disk? Well, that sounds better, but entails at least three problems (big & small):

1. We still need to lock the whole dataset. Let’s say that we’re able to copy the dataset in memory at the rate of 1Gb per second (which still sounds optimistic because in practice the rate could be 200–500Mb per second for more of less sophisticated in-memory data structures). 256Gb/1Gb == 256 == (roughly) 4 minutes. 4 minutes of update downtime per day or 99.7% uptime, which is definitely better than 97%, but still not good enough.

2. After we have the whole dataset copied to a separate buffer in RAM, we need to dump it to disk. While we’re dumping it, the original copy of a database is changing. We need to handle it somehow, e.g. to save the transaction id with the snapshot to keep track of which transaction was the last considered within the snapshot. It’s actually no big deal, but still needs to be done.

3. We double down on RAM. I mean that we really need twice as much memory as we need for just a dataset, ALWAYS. Not only during snapshotting, but ALWAYS, because you can’t just add memory to a machine only during snapshotting and then take it back.

One of the ways to fix this problem is to use use system copy-on-write mechanism provided by fork. When you fork, then a separate process is created with its own virtual address space, which has a read-only copy of the whole dataset. Read-only, because all changes happen in a parent process. So, we fork a process and then take our time to dump the dataset to disk. The only question here is — how is it different from the previous algorithm? The answer lies in the COW mechanism of Linux. COW is an acronym for “copy on write”. What does that mean? It means that a child process initially shares all the memory pages with its parent. And then, as soon as a parent or a child changes something in memory, a copy of a page is created.

Of course, this copying entails some delays in responding, because we need to copy the page and even more. The size of a page is normally 4Kb. If you, say, change a small value in a database, then a page fault exception will be raised first (because all the pages of both the parent and the child are marked as read-only after fork), then the system will switch to the kernel mode, then allocate a new page, copy 4Kb from the old page to the new one, and return back to the user mode. This is still a very simplified description of what is happening. For more details, start from here: https://en.wikipedia.org/wiki/Kernel_same-page_merging. If this change touches many pages (which is very likely for spacious data structures like trees), then the same thing will happen many times. So, you can really slow down a database due to the COW mechanism. For heavy workloads, this delay can result in bad latency spikes or even in a short downtime. Plus, for heavy workloads, there will be numerous random page updates in the parent process, which can make almost the whole database be copied, which in turn can entail a double down on RAM. So, if you’re lucky, then no latency spikes, no downtime. If not, then yes to latency spikes and yes to downtime, oh and yes to double down on RAM.

Another problem with fork is that it copies the table of page descriptors. If you use, say, 256Gb of memory then the size of this table could be hundreds of megabytes, and your process can freeze for a second or so, which entails latency spikes.

Fork is not the ideal solution. Right? But this is the best that we’ve come up with so far. Actually, some popular in-memory databases are still using fork for snapshotting. For example, Redis.

How to make a better solution? Let’s look closer at the COW mechanism. It “cows” by 4Kb. So, you change only one byte, but still “cow” the whole page (in fact, many pages when it comes to trees, even if you don’t need rebalancing). What if we just implement our own COW that will copy-on-write only the pieces of memory that were actually changed? More specifically, only the values that were changed. Of course, we won’t do that as a substitution for the system mechanism, but just for our snapshotting needs.

The main idea here is to augment all our data structures (tree, hashes, table spaces) with the ability to store many versions of each data item. That is very close to multi-version concurrency control (MVCC). The difference is that it is used here not for concurrency control in general, but just for snapshotting. When snapshotting starts, all change operations now and on create newer versions of data items, whilst all older versions are still alive and used in snapshotting. Look at these pictures. This logic applies to trees, hashes and table spaces:

You see that data items can have old versions and new versions. For example, at the last picture you see that values 3,4,5 and 8 of the table space have old versions along with new versions attached to them, while values 1,2,6,7,9 have only single versions.

Only newer versions are being changed. The older ones are used for read operations during snapshotting. The main difference between this MVCC-like COW and the system COW is that we “cow” not the whole 4Kb page, but only a small piece of data that is actually changing. Say, if you update a 4-byte integer number, then this will create a new version of this 4-byte integer and only these 4 bytes will be copied (plus some bytes as overhead for versioning). Compare this with 4096 bytes that were copied during the system copy-on-write, plus page faults and context switching (every context switching actually costs like copying 1Kb of memory or so), plus repeat all of these several times. Lots of job? Only to change a 4-byte integer during snapshotting.

We adopted this manual COW technique for snapshotting in Tarantool since 1.6.6 version (before that we used fork).

There will be more articles about that with more details and graphs from Mail.Ru Group’s production servers. Stay tuned.