1,617 reads

Principles of a Clean Relational Database

by Andrey SundukovSeptember 30th, 2022

Too Long; Didn't Read

The article is going to describe some important principles of how to design your database to be fast, flexible, and easy to use for usual business operations. It is trying to avoid some science staff like normal forms of databases but describes them in some practical examples. The main rule in designing an OLTP database is fully relying on the SRP principle. The same principle works for columns as well. It is easier to understand when you think about how to organize your entities and DAO classes. If your tables in the database are not SRP compliant you also cannot easily follow SRP on DAO.

featured image - Principles of a Clean Relational Database

The Clean Database

Firstly, this article speaks about relational databases. And secondly, about OLTP databases - Online Transaction Processing databases. It means that the article is going to describe some important principles of how to design your database to be fast, flexible, and easy to use for usual business operations. Note that business operations for an OLTP database mean that each operation usually needs just one entity. So, a clean OLTP database means that your database is correctly designed for a SOLID code.

The text briefly explains some crucial points in database design.

Single Responsibility Principle for the table

Remember that all objects in the databases should comply with the Single Responsibility Principle. It is easier to understand when you think about how to organize your entities and DAO classes. If your tables in the database are not SRP compliant you cannot easily follow SRP in DAO classes. So, the first, and probably the main, rule in designing an OLTP database is fully relying on the SRP principle.

And sometimes it is not so obvious how to prepare your tables according to SRP. Let’s check on a table of clients:

At the first glance, this looks like it is all right. But looking closer, we can see that this table consists of two different sets of fields: the first set is just for client login operations and the second set contains contact info that is usually required by another set of operations.

Going deeper - you also can have some service accounts that actually don’t need addresses or login_attempts. Also, you can add some additional fields in the future like last_activity, presige_level, and so on. And for each type of operation, you need to retrieve all data for each client even if it is a login attempt or an email search, for notification purposes. It is not just more difficult for the database but it also makes your objects heavier and codes more cumbersome.

Furthermore, some databases (for example PostgreSQL) rewrite the whole row on each update operation even if you change just a single boolean field. It means that if your table consists of 100 columns instead of 10 it would write 10 times more data on each update. And write more often than if you split the table. At the end, in the worst scenarios, the database writes dozens of times more data compared to small tables.

How to design it right? For example:

Or the client-address relation may be many-to-many. In that situation, you also can use the address table for other entities in the system and use the same functionality and UI to update the addresses of different objects. In the end, we have a different set of operations, for each set of operations we manipulate a different set of data. And it is more important that in general, our operations generate less workload on disks under the database. Also, each operation is blocked by other operations much less since from the database perspective they work with different tables.

In the end, we have a more flexible set of tables, they generate less workload on the database and give us the wriggle room to write more SRP-compliant code. One additional benefit is that our tables are just way lighter and each ALTER TABLE query in the future migrations will be executed much faster.

Single Responsibility Principle for the column

The same principle works for columns as well. Let’s look at the next table:

Sometimes developers decide to add additional functionality to tables to avoid migration. In the example above there was the table with operation types received and sent. To avoid migration someone decided to add an additional flag and put it to the field operation_type. For example b2b flag. And now we have received, sent, b2b received, and b2b sent.

From some point of view, it could be reasonable but what happens next? For example, we need to add an operation bonus. Should we add bonus” and b2b bonus”? What if a bonus can be received and also sent? Should it be 8 possible values? (2 for received and sent) * (2 for b2b and non-b2b) * (2 for bonus and non-bonus). How it should be managed in the logic?

We should never mix different functionality and flag in a single field.

NULLs

There is an important rule - avoid NULLs. Let’s look at the transaction table again:

These _on fields look like some enum. They are also only updated on status change operations and are needed in the next usual operation or status changes. If we need to add 4 more statuses? Add 4 more columns? So why do we store that data in such an important and heavy table as transaction? It could be changed to a set of 2 tables:

In that structure, we literally extract half of the data outside of the core table without losing any data. Also, we add some new opportunities to add some additional data. For example, we can add user_id and log users who change the status of a transaction.

The Clean Database

There are still other points about designing databases to optimize operations but this is a topic for another article.

Designing a clean database is a very important part of an engineer’s work. When the database is already populated with data - it is much harder to change the database structure. The worse a database is organized the harder it is to keep the design clean. So do it as early as possible.