This article is related to describing the methods to verify database information internals. That helps to be sure of data consistency after migration and after each restart.
A modern database is the core component for most software systems that process a large amount of data. That is why it is crucial to know that data has a predictable shape every time on processing and after not standard situations.
On each query, we can kill the database at any moment and can try to turn it on next.
But why can we be sure that data is still consistent?
How can we be sure of data consistency after a migration that was complex enough?
How can we be sure that data is not harmed?
Let’s talk about possible ways to answer these questions.
If we will talk about just a data blob, this problem solution becomes slightly easier to understand. Blob is just a finite set of bytes on some physical storage. To verify that it is not changed, we can use widely known algorithms for hashing its body and size. Hashing is used to get unique small-sized results for unique big-data blobs.
In picture 1, we can see that size needs to be included in hash computation. It will guarantee that hashing on different counts of bytes will not get an equal result as we have for the current data blob.
At the same time, if we can count the hash for just one blob, we can count it for any number of blobs equally. As the final result for each of the blobs set, we can compute aggregating hash as shown in picture 2
In this picture, we will compute a hash of hashes. At the end of the process, we will get a final aggregated hash value that represents the state of data blobs collection.
This property of aggregating hash is very useful for detecting the state of a database. Databases store the data as a number of pages for easy interaction and management. These pages can be the blobs covered by an aggregated hashing process to define the database state.
But what should we do with hashing if the database context changes? To understand how to make hash computing continuous, we need to define the properties that can be changed:
database structure (schema)
database pages collection ( order and count )
database pages internals ( data that will be stored inside )
For all of these cases, computation of the aggregated hash can be fast possible if we will track the changes in the hash tree.
Let’s look at schema mutation. It can be represented just by the number of SQL statements that can be represented in a hierarchical sort (keys are: database, table, column). On “no changes,” we will have a schema represented in a unique way - that is why the hash will be equal.
For page collection order changes, we can expect that the context ( then hashes) of each of the pages are still unchanged. That is why sorting by page hash and then using a sorted set of pages will be a solution.
Same time we need to keep in mind that physical pages and representation of pages in memory can have different footprints. It can harm aggregating hash computation.
In this picture, the result will be different for aggregating hash computed on physical device pages collection and in-memory representation. That is why it is important to use physical device-represented pages for aggregating hash computing. Expected that pages will be equal on any platform.
If we increase the count of pages in memory, then we have to change the collection and add additional hashes to compute the aggregated hash. It will change the value of aggregating hash to show that the database state is not equal to the previous one.
Let’s consider the case where the data of the pages will not be the same after modification. It is possible to have some pages to be changed and some pages not.
On each of the changes of the data in memory, we can detect it. It can be to run a follow-up task to recompute the hash for the specific page. In this case, total hash recomputing will not be the only option.
It can be wise to collect all of the change flags to a specific table to compute the hash on the delayed task next. This option can be tuned wisely to reach the proper CPU/memory utilization ratio to speed of reply for aggregated hash requests.
Until this time, we have in considering the pages as bytes of data that might be incorrect for some data types:
integer values can have different numbers of bits on each hardware
strings can have equal value, but the different memory footprint
Values with final precision (float/double/big decimal) can have different memory footprints and precision from database to database according to the setting.
Even these samples show that just getting the values of columns from memory and processing it in natural form may be incorrect for aggregating hash computing. This is why all of the data on hash computation have to be represented in an intermediate form to make hash computing predictable. The best possible way is to have text representation with well-defined settings of representation.
Next, someone will compute the hash on each data page representation. This actor will get predictable results for the database state before migration and for the database after migration or any possible query.
The ideal result will be the equality of two hashes. But in real life, it can differ. That is why storing a full Merkle tree can be valuable. This way, it is still possible to compare each of the data pages that do not match. It Is also possible to have a Merle tree per column to detect specific changes.
All methods of hashing described in this article make it easier to answer the questions:
This approach tends to make the migration and changes verification process trivial. It will save a lot of time and make stakeholders be sure of the quality and duration of changes.