This article will be of particular interest to people who are curious about how powerful encryption can be implemented in high performance systems. It is also for people who just want to know a bit more about how strong encryption works and how all the pieces fit together. Introduction . — for example, at Adobe, Anthem, eBay, Equifax, Home Depot, JP Morgan Chase, Marriott, Sony, Target, Uber, VeriSign, Yahoo, the US Office of Personnel Management, and even good security companies like RSA Security — have cost billions of dollars and compromised the privacy or identity of over a billion users. Making a system secure is very difficult Data breaches of large and widely used systems Everyone says they want more secure systems, but people don’t like using systems that are slow or hard to use, and making a system more secure can add delays and reduce convenience. It is clearly bad when people avoid using a system that is secure, and instead choose a system that is less safe but faster or easier to use. Because databases store the data that we most need to protect, they are a major piece of the security puzzle. How can we add strong encryption to a database without unduly compromising speed or usability? Looking at how various companies have done this can help. Badger, a Fast Database This article examines how the team at added , an extremely fast and reliable open-source database. has over 7.3K stars on Github and is used in over 850 projects, including and Uber's . Most notable is , which stores petabytes of data using Badger. Dgraph Labs data encryption to Badger Badger IPFS Jaeger UsenetExpress Badger is a (KVDB). A KVDB stores data in associative arrays (also variously called objects, dictionaries, maps, or hash tables). key-value database KVDBs have become popular because they are one of the simplest and fastest kind of databases, and they are easy to learn because they use a data structure that is common in most programming languages. KVDBs are well suited for use in cloud computing, and Badger is particularly fast for cloud use because it is , especially SSD storage. Badger provides on such hardware, compared to other popular KVDBs. heavily optimized for typical cloud hardware great performance The challenge for the team was to figure out how to implement encryption without compromise. We knew that this should be possible, because companies were able to (eventually) figure out how to add encryption to http (creating https) without making it noticeably slower for the end user. As we shall see, Badger’s advanced implementation positively affected how encryption was added to it. Implementation of Badger Even without encryption, the implementation of Badger makes it safer and less prone to errors that could compromise security. It is written in the , which was developed by a small team at Google that included ( , , , ). The goal of Go was to improve the C language, by adding memory safety, automatic garbage collection, safe concurrency, and multiprocessing across networked computers. Badger has quickly become the most active KVDB written in Go. Go language Ken Thompson UNIX regular expressions Plan 9 UTF-8 Badger is designed to be used inside of other projects. It is , rather than as a separate service. This increases both security and speed. And as mentioned before, , so it can be or even modified to meet project needs. implemented as an embeddable library Badger is open source reviewed to find unsafe code Badger is also very reliable. It guarantees serializable snapshot isolation (SSI) by running concurrent transactions using multiversion concurrency control ( ). Badger is resilient in the face of process, filesystem, and even hardware crashes. ACID MVCC Separating Keys from Values An advanced feature of Badger’s implementation is splitting up the key-value pairs and storing the values separately from the keys. Keys are typically short strings, so storing them together in one place makes searching for a key significantly faster. Keys typically don’t change as often as values, so more extensive optimization of the key files (including the hash tables used to look up a key) can be done to improve performance even more. On the other hand, values are typically larger and change more often than keys. For example, a value can be an entire document that is updated as the document is edited. Multiple values are stored in a value log file (vlog). Associated with each key is a , consisting of the vlog file id, the offset of the value in the file, and the size of the value. Note that as an optimization, values smaller than a user specified size (default 32 bytes) are stored with their associated keys rather than in a vlog file. pointer to its value Assuming 16 bytes per key and 16 bytes per value pointer, . This is small enough to fit inside the RAM of virtually any computer, which means that often a key search can be done in high-speed memory, which can be orders of magnitude faster than searching secondary storage. a single 64MB file can store two million key-value pairs Layering Badger In addition to being used by itself as a database, Badger can be used to implement more sophisticated databases. In particular, Dgraph Labs specifically for use in its flagship product, , the first native GraphQL database. Dgraph is open source, and is highly scalable to a large number of servers either on fixed computers or in the cloud. created Badger Dgraph Layering Dgraph on top of Badger provides significant benefits, in particular . Dgraph uses Badger to store the user’s data, so Dgraph is not concerned with data storage at all (including encryption of stored data). Because Badger is a smaller, independent system, new features like encryption can be built, tested, and verified more easily and with more confidence. Badger transparently provides encryption to Dgraph, or to any other system that uses Badger to store data. separation of concerns This is similar to the , which provides high level protocols like TCP on top of lower level protocols like IP. This allows other high level protocols (e.g. UDP, DCCP, SCCP, RSVP) to be built on top of IP. When improvements are made to IP (such as IPv4 to ), higher level protocols can all take advantage of them easily. Internet Protocol Suite IPv6 Advanced Encryption Standard One of the easiest decisions to make when adding encryption to Badger was picking the highly regarded encryption algorithm, standardized by the and used by MongoDB, SQLite, and many other databases and systems. AES has become an industry standard as it is the most algorithm for encrypting data. Wide use is critical for encryption standards as they help potential security flaws get found and fixed. AES US NIST secure and widely used AES is symmetric; the same encryption key is used for both encrypting and decrypting data. Badger supports to further secure access to data. This allows Badger to be used in systems that need to meet the various data protection regulations and requirements, including GDPR, HIPAA, and PCI DSS. key rotation Key Rotation To encrypt and decrypt data requires an encryption key. AES keys come in three sizes: 128, 192, and 256 bits. Which size you use is actually , as a brute force crack of a 128-bit key would take the fastest computer in the world over a hundred-quadrillion (10 to the 17th power) years. Even if you could get 10 million supercomputers to work together to do the crack, it would still take longer than the current age of the universe. not very important Much more important is maintaining the security of the key to avoid . Obviously, encryption keys must be stored securely by the user and should not be easy to guess. However, there are other sources of key leak. For example, “side channel” attacks have that it is possible to crack even the most secure (256-bit) AES keys in a short period of time by measuring electromagnetic radiation coming from a computer, using a device that costs around $200 and can fit into a jacket pocket. Computers doing encryption must be physically secure to prevent this. key leak demonstrated Even without physical access, if the same key is used too often there are known ways that attackers can determine the value of the key by analyzing large amounts of data encrypted using that key. In order to avoid this, the key must be changed , which is referred to as However, when the key is changed, existing encrypted data must be decrypted using the old key and then re-encrypted using the new key. These computations can severely reduce performance for encrypted data. regularly key rotation. One Key to Rule Them All Instead of using the AES encryption key directly to encrypt data, Badger uses two types of keys: 1) A user-provided AES encryption key, called the , is used to encrypt data keys. 2) are generated automatically by Badger, and used to encrypt and decrypt data on disk. Each encrypted data key is stored on disk along with the encrypted data. master key Data keys By default, Badger rotates the data keys every ten days, but this can be changed by the user. The length of the master key provided by the user must be 16, 24 or 32 bytes. This determines what version of AES encryption is used (AES-128, AES-192, or AES-256 respectively).Note that If you have a password manager (such as 1Password, LastPass, etc.), you can use its built-in password generator to generate a strong encryption key. Even if you don’t have a password manager, you can use a reputable online password generator, such as to generate your master key. you should never use a predictable string as your master key. 1Password Fortunately, because the master key is used to encrypt only data keys, which are (even all together) much smaller compared to the data stored in the database, the master key does not need to be rotated as often (as data keys) to prevent key leak. Even better, when the master key is rotated, only the data keys need to be decrypted using the old master key, and then re-encrypted using the new master key. This is much faster than re-encrypting all of the data on disk. You should rotate your master key on a regular schedule. Avoiding Encrypted Duplicates When data keys are used to encrypt data stored in the database, the same data will often be encrypted multiple times before the rotation period for the data key expires. Encrypting the same data with the same data key always generates the same encrypted text to be stored in the database. This increases the ability of an attacker to reverse engineer the original plaintext data. To reduce the predictability of the original data, Badger incorporates a standard encryption technique that doesn’t use the data key directly to encrypt the data. Instead, (IV). The data key is used to encrypt the IV and then the encrypted IV is XORed with the user data, and the result is stored on disk. This means that even if the same block is encrypted multiple times, the random value of the IV ensures that the stored text will be different each time. Badger generates a random 16-byte array called an Initialization Vector In addition to reducing data predictability, the use of IVs increases performance because an XOR operation is vastly faster than running the AES encryption algorithm directly on the data to be encrypted. The overhead of this increased security and performance is the extra storage used for each 16-byte IV. Because of this, a potential issue with using IVs is that it can lead to data bloat. Reducing the Storage Overhead of IVs The files containing the keys are each divided into blocks, where each block is 4KB in size. Badger uses a unique IV to encrypt each of these blocks and stores the IV in plaintext at the end of the encrypted block. The storage overhead of a 16B IV over a 4KB block is 0.4%, which is reasonable. Note that storing the IVs in plaintext is still secure. Assume that a cracker gets access to the IV and an encrypted block. To decrypt the block, they’d need to encrypt the IV with the data key (to XOR the encrypted data back to plaintext). But to get access to the data key, they’d need to decrypt it using the master key. If master key is safe and secure, that won’t be possible, rendering the effort futile. Thus, knowing the IV is not sufficient to decrypt the data. Avoiding the bloat of attaching a 16-byte IV to every value Next, we need to encrypt the values stored in the vlog files. Each value is read individually from a vlog file, using the value pointer associated with each key, which stores the vlog file id, the offset in the file, and the size of the value. If we encrypt a whole block at a time, having multiple values in a block would cause a significant performance slowdown because considerable extra data would need to be decrypted to read an individual value. As values are often accessed individually, we decided to . Because each encrypted value is required to have its own random IV, using one IV per value adds a 16B overhead for each value. For a vlog file that contains 10,000 entries, storing an IV with each value would require 160,000 bytes, which is not good. encrypt each value individually It isn’t quite as bad as that, because small values are stored directly in the key files, instead of pointers, and that value gets decrypted along with its key. But even with this optimization, unless all values are very small it is easy to imagine that having one IV per value could double the size of a vlog file. Luckily, we came up with a better solution. To optimize the encryption of the vlog entries, . Instead of generating a 16-byte IV and storing it at the end of each value in the vlog, Badger generates a single 12-byte IV that is used for all values in a vlog file. Badger uses a unique technique Along with it, Badger attaches 4 bytes that contain the offset of the value in the vlog file, which together make up the required 16-byte IV. For decryption, the 4-byte vlog offsets are available from the (decrypted) value pointer stored with each key. The 12-byte IVs are unique for each vlog, and the 4-byte offsets are unique within a single vlog file, so each composite 16-byte IV is also unique. This technique saves 16 bytes of space on disk for every value in a vlog file. Instead, this technique only requires 12 bytes per vlog file, regardless of the number of values in a file. That’s an even better savings than for the key files. Conclusion We added encryption to Badger using established and proven standards and best practices, implemented in a way that works well with the existing features and benefits of Badger. Most importantly, the addition of encryption did not appreciably slow Badger down, make it difficult or inconvenient to use, or significantly increase the storage used. Badger is open source, and has been used as the storage engine for many large systems, and now it includes encryption for free. We hope Badger will make it easier for even more projects, especially ones that need to securely store encrypted data. For more information about Badger , see the Dgraph blog entries about it , including this one about encryption in Badger .

Adobe

Anthem

eBay

Google

Home Depot

Sony

Target

Uber

Why Flutter Uses Dart

Read my new article https://medium.com/@wmleler1/adding-encryption-to-a-fast-database-63ca1e14edc4

Read My Stories

Too Long; Didn't Read

Adding Encryption to a Fast Database, Without Compromise

Adding Encryption to a Fast Database, Without Compromise

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

The Best Display for Programming?

10 Ways to Optimize Your Database

10 Ways to Reduce Data Loss and Potential Downtime Of Your Database

10 Principles of Proper Database Benchmarking

10 Minute Guide to Fixing Damaged SQL Databases - No Recovery Required!

10 Best Online Courses to Learn Oracle and PL/SQL for Beginners

The Best Display for Programming?

10 Ways to Optimize Your Database

10 Ways to Reduce Data Loss and Potential Downtime Of Your Database

10 Principles of Proper Database Benchmarking

10 Minute Guide to Fixing Damaged SQL Databases - No Recovery Required!

10 Best Online Courses to Learn Oracle and PL/SQL for Beginners

Light-Mode

Classic

Newspaper

Minty

Dark-Mode

Neon Noir

Minty

HN StartUps