We were using MySQL: one master for writing, and a few slaves for reading. Besides replication lag (more on that later), it worked fine. And then you scale. And then it fails! The master is a SPOF — “Single Point Of Failure”.
Whether the process failed, the VM crashed, or we had to do maintenance on the server, having only one “master” to write to meant that the service was down for minutes. And this was not acceptable.
The first thing we implemented was sharding based on the “type” of data: there was the “stats” cluster (one master + N slaves), the “calls” cluster, the “customers” cluster, and so on. This allowed easier maintenance of the servers, as we could work on one cluster without impacting the others.
But there was still a SPOF in each cluster: the master.
To get rid of the master-slave architecture where the master is a SPOF, we needed a solution where we could write on any node, failing over to another node if the previous one was not available. For reading, we needed a solution where we could run queries against “federated” data (reading from every node).
This is not as simple as it sounds. When working on distributed systems, you have to work with:
Yes, the CAP theorem.
What should happen if you update a row on a node, and at the same time you update the same row on another node? Data reconciliation is hard (and sometimes impossible).
What if you have a network failure, and the data is not consistent between the nodes? Which one do you trust?
We looked at many technologies. There was the ‘master-master’ topology. MySQL Cluster was a good candidate. There was also Galera cluster, tungsten replicator, MySQL proxy… (At the time, MySQL group replication was not production ready).
We also looked at the NoSQL world. Cassandra, MongoDB, HBase… but:
New technologies are sexy for sure. But stability and maintenance are key when you run a business.
Note: there are many solutions now, especially if you are “in the cloud” and starting from scratch. The solution we expose here works pretty well if you have an existing MySQL infrastructure, an existing code base, people that have experience with MySQL, and the need to scale without changing the whole storage layer.
There are two kinds of data we work with:
For the permanent data, we use a master-master topology. Nothing fancy here. We always write on the same master. If it fails, or if we need to do maintenance, we switch to another master.
The story here is for the usage data.
We call this the “DBRW” architecture because we split reads/writes.
Our implementation is pretty simple: when we need to insert data, we choose a write server randomly (*). If it is not available, or if the error lets us failover, we try another write server. Boom! The SPOF is gone!
Write servers contain only the data that has been written on them. Their data is replicated asynchronously on the read servers.
Read servers federate data from all the write servers, using multi-source replication. We apply the same rule for connection than with the write servers: we choose one server randomly (*), and fail over if it is not available.
(*) Our system is a bit more clever: it avoid servers that are in maintenance or marked down by the supervision, the “random” is weighted, and we always try servers in the same datacenter first.
We INSERT
and UPDATE
on write servers. We SELECT
on read servers.
Read servers receive and “merge” data from all write servers, by leveraging multi-source replication. To avoid conflicts on primary keys, we use an increment increment + offset on the write servers. For example:
This way, INSERTs are not subject to conflict.
When we need to UPDATE, we need to do it on the server that has the data. Write servers retain data for just a few days, older data is purged. Fortunately, once written, the usage data is rarely updated, or it is within hours.
Obviously, this architecture does NOT work for data that needs to be updated continuously over time!
To know which server holds the data with the primary key, all we need to do is apply a modulo of the increment, and it will give us the offset, thus the server. Example : 41 mod 20 = 1
(DBW1) or 62 mod 20 = 2
(DBW2).
We have been using this architecture for years, and it has been working great so far.
Implementing the read/write split can be tricky. In our case, most of our projects had a “datastore” layer, in which we could easily implement the split, and the sharding.
The most important part is not to forget to read your own writes when doing updates, or if/when replication lag is an issue.
Originally published at blog.callr.tech on August 29, 2018.