This article describes how the developers of the in-memory computing platform - Tarantool -implemented disk storage. We will have a look at how that fits within our in-memory technology and why we chose the LSM tree architecture.
Tarantool is a transactional persistent DBMS that stores 100% of its data in memory. The main advantages of this approach are speed and ease of use. A few years ago, we decided to expand our product with a classical storage technology used in traditional DBMSs, where only the cache is stored in RAM, while most of the data is stored on disk.
The idea to add a disk engine to our product came from the fact that the users do not access all the data equally often. For example, they usually check and change their account balance more frequently than their name. Another example would be the account balance in the current month compared to the account balance several years ago. You would check the former more often than the latter.
This article will explain why we decided to create our own disk engine and how we implemented it based on LSM trees.
Reinventing the wheel: A custom LSM-based implementation
Here is the first question we faced: Should we use an existing library or make our own custom engine? The open-source software community had ready-to-use libraries that we could work with. RocksDB was the one with the most development activity. There were also less popular libraries like WiredTiger, ForestDB, NestDB, and LMDB.
After looking at the source code of those libraries and weighing the pros and cons, we decided to write our own implementation. We found that if we embedded one of those libraries into Tarantool architecture, our users would bear the cost of multithreading without enjoying its benefits.
Tarantool has actor-based architecture, which means that transactions are processed in a dedicated thread. This helps avoid unnecessary blocking, inter-process communication, and a lot of other overheads consuming 80% of CPU time in multithreaded DBMSs.
If we design a storage subsystem with cooperative multitasking in mind, it would significantly speed up the workflow and allow us to implement optimizations that are too complex for multithreaded engines. Thus, embedding a third-party solution in Tarantool wouldn't achieve the best result.
So we abandoned the idea to use an existing library. Now we had to choose an architecture for our custom implementation. There are two generations of disk storage solutions: based on B-trees and LSM trees (log-structured merge-trees), respectively. For example, MySQL, PostgreSQL, and Oracle use B-trees, while Cassandra, MongoDB, and CockroachDB use LSM trees.
Previously, it was believed that B-trees are better for reading data and LSM trees are better for writing. However, SSDs have gained popularity since. As their read performance greatly exceeds their write performance, the advantages of LSM in most scenarios became clear. So we decided to go withВ LSM.
One of the main problems solved by LSM trees is the amount of incidental I/O operations during data updates. With a B-tree, every write operation may require a 40 times larger write volume than the actual amount of data written.
In blogs and literature on disk storage, incidental reads are referred to as read amplification, and incidental writes as write amplification. The amplification factor is formally defined as the ratio between the actually read (or written) data and the volume required for the operation. In our B-tree example, the amplification factor is about 40 for both read and write operations.
The crucial difference between LSM trees and the classic B-trees is that LSM stores data operations like inserts and deletes, not the data themselves (keys and values). Plus, memory and disk are strictly divided in the LSM tree structure.
Filling the LSM tree
Every LSM element contains a log sequence number (LSN). The LSN — a member of a monotonically growing sequence — uniquely identifies each operation. Therefore, the whole tree is sorted by key in ascending order, and every key is sorted by LSN in descending order.
Unlike the B-tree, which is entirely stored on the disk but can be partially cached in RAM, the LSM tree is explicitly strictly divided between the disk and RAM. The part of the tree that is located in RAM is called L0 (level zero). Because RAM is limited, L0 is stored in a dedicated memory space.
The size of L0 is controlled by the
vinyl_memory parameter in the Tarantool configuration. At first, when the tree contains no elements, all the operations are written to L0. Suppose a new value is inserted at a specific key, replacing the old value. In that case, it's easy to find and delete the old value because the elements are sorted by key in ascending order and then by LSN in descending order.
Sooner or later, the tree will contain more elements than the size of L0 allows. Then all the contents of L0 will be written to a file on the disk, and L0 will be freed up for new inserts. This operation is called dumping.
All the dump files on the disk form an LSN-ordered sequence. The LSN ranges in the files do not overlap, and files with more recent operations are closer to the beginning of the sequence. Think of this as a pyramid with newer files at the top and older files at the bottom.
The pyramid gets higher with each new file. The more recent files may contain deletion or replacement operations for existing keys. The old garbage data is removed when several files are combined into one. This process is called merging or compaction. If there are two versions of the same key during the merge, only the newer version is retained. And if a key was inserted and then deleted, both operations are removed.
The files on the disk are organized into a pyramid so that the LSM tree is optimized for different scenarios. The newer the operation is, the higher up are the corresponding data. When it comes to merging, two or more neighboring files, preferably of similar size, are combined into one.
Neighboring files of similar size form a tree level. File sizes on different levels determine the shape of the pyramid. This allows optimizing the tree for intensive inserts or reads. Files can be merged endlessly, moving up and down the pyramid. If there are many deletions, a file that results from merging may move closer to the top because its size is smaller than the sizes of the original files compacted.
Search in the LSM tree yields not the element itself but the last operation with it. If this is a deletion, the element of interest is not in the tree. If this is an insertion, the element corresponds to the top value in the LSM pyramid, and the search can be stopped at the first key match. In the worst case, when the element was never in the tree, the search goes through all the successive tree levels, starting with L0.
Why store deletions at all? Also, why doesn't it lead to tree overflow in scenarios like
for i=1,10000000 put(i) delete(i) end?
Deletions have a unique role in the LSN tree. When a search is performed, they indicate the absence of the value, and during merges, they help clear the tree of garbage records with older LSNs.
As long as the data are only in RAM, there is no need to store deletions. Deletion operations are also removed after merges that affect the lowest level of the tree with the oldest dump data.
When the value of a secondary index is changed, a deletion often immediately follows the insertion of the unique value. Deletions like these can be filtered out when intermediate levels are merged. This optimization is implemented in Vinyl.
Advantages of LSM
Thus, our process involves regular L0 dumping and L1-Lk compaction. Besides reduced write amplification, it has many more advantages over the approach used in B-trees:
- Dumping and compaction result in relatively big files. L0 is typically as large as 50-100 MB, which is thousands of times more than the size of a B-tree block.
- The large size allows the files to be efficiently compressed before writing. Tarantool compresses the files automatically, which further reduces write amplification.
- There are no fragmentation costs because elements in every file follow each other without gaps or padding.
- All the operations create new files instead of changing old data at specific locations. This allows us to get rid of nasty blocking because operations can run in parallel without conflict. It also simplifies backup and replication.
- Storing old versions of data allows implementing efficient transaction support through multi-version concurrency control.
While in-memory data processing is very fast, not every operation requires high speed. For this reason, a disk storage engine is worthwhile even in high-performance DBMSs.
In Tarantool, data are always accessed from the same thread. The existing libraries don't go well with this approach. We didn't want our users to bear the cost of multithreaded applications without gaining any advantages, so we created our own engine.
Today, data storage solutions can be divided into two generations — those based on B-tree varieties and the more modern ones based on LSM trees (log-structured merge-trees). For example, MySQL, PostgreSQL, and Oracle have B-trees at their core, while Cassandra, MongoDB, and CockroachDB already use LSM trees.
It is considered that B-trees are more read-efficient while LSM trees are write-optimized. However, with SSD storage gaining traction, the strengths of LSM have become apparent because the read performance of SSD devices exceeds their write performance by several times.
That's why we created Vinyl, our LSM-based engine. In this article, we explained the algorithms behind it. This is a rather complicated topic, so don't be shy and ask your questions in the comments or in our support chat.
During our work on the second engine, Tarantool matured a lot. It is now used by Internet providers, game developers, banks, telecom, and retail companies. The stable version of Tarantool with the Vinyl engine was successfully evaluated in production environments by local and global companies based in different countries. You can download it on the official Tarantool website