In October, I’ll be in New York on O’Reilly Velocity Conference, giving a “What We Talk About When We Talk About On Disk IO” talk. I’ve decided to release some of my preparation notes as a series of blog posts.
Knowing how the IO works, which algorithms are used and under which circumstances can make lives of developers and operators much better: they will be able to make better choices upfront (based on what is in use by the database they’re evaluating), troubleshoot the performance issues when the database misbehaves (by comparing their workloads to the ones the database stack is intended to be used against) and tune their stack (by spreading the load, switching to a different disk type, file or operating system, or simply picking a different index type).
While the Network IO is frequently discussed and talked about, Filesystem IO gets much less attention. Of course, in the modern systems people mostly use databases as storage means, so applications communicate with them through the drivers over the network. I believe it is still important to understand how the data is written onto the disk and read back from it. Moreover, Network IO has many more things to discuss and ways to implement different things, very different from one operating system to another, while Filesystem IO has a much smaller set of tools.
There are several “flavours” of IO (some functions omitted for brevity):
Today, we’ll discuss, the Standard IO combined with series of “userland” optimisations. Most of the time the application developers are using it plus a couple of different flags on top of this. Let’s start with that.
There’s a bit of confusion in terms of “buffering” when talking about stdio.h functions, since they do some buffering themselves. When using the Standard IO, it is possible to choose between full and line buffering or opt out from any buffering whatsoever. This “user space” buffering has nothing to do with the buffering that will be done by the Kernel further down the line. You can also think about as a distinction between “buffering” and “caching” which should be make the concepts different and intuitive.
Disks (HDDs, SSDs) are called Block Devices and the smallest addressable unit on them is called sector: it is not possible to transfer an amount of data that is smaller than the sector size. Similarly, the smallest addressable unit of the file system is a block (which is generally larger than a sector). Block size is usually smaller than (or same as) the page size (a concept coming from the Virtual Memory).
Everything that we’re trying to address on disk, ends up being loaded in RAM and most likely cached by the Operating System for us in-between.
Page Cache (previously entirely separate, Buffer Cache and Page cache got unified in 2.4 Linux kernel) helps to keep cache the buffers that are more likely to be accessed in the nearest time. Temporal locality principle implies that the read pages will accessed multiple times within a small period in time, and spatial locality implies that the related elements have a good chance of being located close to each other, so it makes sense to save the data to amortise some of the IO costs. In order to improve the IO performance, the Kernel buffers data internally by delaying writes and coalescing adjacent reads.
Page Cache does not necessarily hold the whole files (although that certainly can happen). Depending on the file size and the access pattern, only the chunks that were accessed recently. Since all the IO operations are happening through the Cache, sequences of operations such as read-write-read can be served entirely from memory, without accessing the (meanwhile outdated) data on disk.
When the read operation is performed, the Page Cache is consulted first. If the data can already be located in the Page Cache, it is copied out for the user. Otherwise, it is loaded from the disk and stored in the Page Cache for the further accesses. When the write operation is performed, the page gets written to cache first and gets marked as dirty in the Cache.
Pages that were marked dirty, since their cached representation is now different from the persisted one, will be flushed to disk. This process is called writeback. Of course, the writeback has it’s own potential drawbacks, such as queuing up too many IO requests, so it’s worth understanding thresholds and ratios that are used for writeback when it’s in use and check queue depths to make sure you can avoid throttling and high latencies.
When performing a write that’s backed by the kernel and/or a library buffer, it is important to make sure that the data actually reaches the disk, since it might be buffered or cached somewhere. The errors will appear when the data is flushes to disk, which can be while syncing or closing the file.
O_DIRECT is a flag that can be passed when opening a file. It instructs the Operating Systems to bypass the Page Cache. This all means that for a “traditional” application using the Direct IO will most likely cause a performance degradation rather than the speedup.
Using Direct IO is often frowned upon by the Kernel developers, and it goes so far that the Linux man page quotes Linus Torwalds: “The thing that has always disturbed me about O_DIRECT is that the whole interface is just stupid”.
However, the databases such as PostgreSQL and MySQL use Direct IO for a reason. Developers can ensure a more fine-grained control over the data access patterns, possibly using a custom IO Scheduler and an application-specific Buffer Cache. For example, PostgreSQL uses Direct IO for WAL (write-ahead-log), since they have to perform a write as fast as possible while insuring it’s durability and can use this optimisation since they know that the data won’t be immediately reused so writing it bypassing the Kernel page cache won’t result into performance degradation.
Direct reads will make a read directly from the disk, even if the data was recently accessed and might be sitting in the cache. This helps to avoid creating an extra copy of the data. The same is true for the write: when the write operation is performed, the write is done directly from the user space buffers.
Because DMA (direct memory access) makes requests straight to the backing store, bypassing the intermediate Kernel Buffers, it is required that all the operations are sector-aligned (aligned to the 512B boundary). In other words, every operation has to have a starting offset of a multiple of 512 and a buffer size has to be a multiple of 512 as well.
For example, RocksDB is making sure that the operations are block-aligned by checking it upfront (older versions were allowing unaligned access by aligning in the background).
Whether or not O_DIRECT flag is used, it is always a good idea to make sure your reads and writes are block aligned: making an unaligned access will cause multiple sectors to be loaded from the disk (or written back on disk).
Using the block size or a value that fits neatly inside of a block guarantees block-aligned I/O requests, and prevents extraneous work inside the kernel.
I’m adding this part here since I very often hear “nonblocking” in the context of Filesystem IO. It’s quite normal, since most of the programming interface for network and Filesystem IO is the same. But it’s worth mentioning that there’s no true “nonblocking” IO which can be understood in the same sense.
O_NONBLOCK is generally ignored for regular (on disk) files, because the block device operations are usually considered non-blocking (unlike sockets, for example). The Filesystem IO delays are not taken into account by the system. Possibly this decision was made because there’s a more or less hard time bound on when the data will arrive.
For the same reason, something you would usually use like select and epoll do not allow monitoring and/or checking status of regular files.
It is hard to find an optimal post size given there’s so much material to cover, but it felt about right to have a clear after Standard IO before moving to mmap and vectored IO.
If you find anything to add or there’s an error in my post, do not hesitate to contact me, I’ll be happy to make corresponding updates.