If you’ve ever thought about trust and hope, you’re not likely to have experienced it anything as much as you have a database management system. Well, really, this is what a database is! The title conveys the whole point — a place where data is kept, the main task is to STORE. And the saddest thing, as always, once these beliefs crash into the ruins of a downed production database.
And what do I do — you ask? Do not deploy anything on the servers — we answer. Nothing unless it can recover itself, at least temporarily but reliably and quickly.
In this article, I will try to talk about my experience of configuring a built to last a lifetime Postgresql cluster within another failover solution from Google — Kubernetes (aka k8s).
Almost any application needs to have a data store. This is a necessity. Making this data store resistant to faults in a network or on physical servers is common courtesy of a good system architect. Another aspect is high availability of a service, even with a number of heavy competing service requests, which means easy scalability upon necessity.
So here are our problems to solve:
Additional points dictated by the religious beliefs of the author:
In the diagram, it will look like this:
master (primary node1) --\
|- slave1 (node2) ---\ / balancer \
| |- slave2 (node3) ----|---| |----client
|- slave3 (node4) ---/ \ balancer /
|- slave4 (node5) --/
Assume the input data:
Having some experience in solving IT problems, I decided to ask the collective mind of Google, “postgres cluster kubernetes” — what I got was a bunch of information junk, then I tried “postgres cluster docker” — not much better, and, finally, “postgres cluster”, which showed up a few options I have chosen to work with.
What got me upset is the lack of decent Docker images along with descriptions of any options for clustering, not to mention Kubernetes. By the way, there weren’t many options for Mysql either, but still one of them is worth mentioning. Like this example in the official k8s repository for Galera(Mysql cluster)
Google has made it clear that I have to look for a solution myself and more than that, do it manually (sigh, at least I have a bunch of scattered articles to browse through).
Let me make it clear, all the points in this paragraph are based on my subjective opinion and therefore might seem quite viable to you. However, based on my experience, I had to cut them off.
Whenever somebody proposes a universal solution (for anything), I always think that such things are cumbersome and hard to maintain. Well, same with Pgpool, which hasn’t become an exception because it can do almost everything:
I only found the first four points useful. Speaking of the remaining three, these are the problems they present:
Actually, having read and compared this (www.slony.info) with Streaming Replication, which is already well-known and runs out-of-the-box, the decision to not even think about the elephants came easily.
On top of that, on the very first page of the website the guys state that “PostgreSQL 9.0 includes streaming replication, which, for a number of use cases, is likely to be simpler and more convenient than Slony-I”, going on to say, “There are, however, three characteristic kinds of cases where you’ll need something like Slony-I”, which are the following:
To put it simply, I don’t think Slony is good enough, unless you need to take care of those three specific tasks.
After sorting out the options for ideal bidirectional replication, it appeared to me that the efforts it takes to implement such replication are simply incompatible with crucial functions of some applications. Not to mention the speed, there is a limitation in handling transactions and complex queries (SELECT FOR UPDATE, etc.).
Perhaps I’m not as versed in this matter, but the drawbacks of the solution that I’ve seen happened to be enough to leave the idea. And yet, after putting my thinking-caps on I figured out that a system with enhanced write operation will need a completely different technology rather than relational databases.
In the examples below you will see what a solution should ideally look like, while the code will show my own implementation. To create a cluster, you don’t necessarily have to have Kubernetes (there’s a docker-compose example) or Docker at all. Just then all this post will be useful not as a CPM solution (copy-paste-modify) but as an installation manual with snippets.
So why did colleagues from Postgresql refuse to use terms “master” and “slave”? Hmm, I could be wrong but there was a rumor that it’s because of non-Political Correctness, we all know that . Well, right.
The first thing you need to do is start the primary server, followed by the first standby layer, then the second standby layer — all according to the task at hand. Hence we get a simple procedure for the start of a typical Postgresql server in the primary/standby mode with the configuration to enable Streaming Replication
What you should pay attention to in the configuration file
wal_level = hot_standby
max_wal_senders = 5
wal_keep_segments = 5001
hot_standby = on
All parameters have short description in comments, but all you need to know is that with this configuration we enable access to read WAL logs for clients and allow execute queries during recovery. You can find interesting article about it on Postgresql Wiki.
As soon as the primary server of the cluster starts up, we move on to start the standby one, which “knows” where its primary is located.
My goal here has boiled down to the assembly of a universal Docker Image, which would work differently depending on the mode, namely the following way:
Sequence is of utmost importance when performing all these operations, thats why dockerize project helps a lot here.
The difference between the first and the second standby layers is that ANY server from the first layer can act as master for a second layer server. Do not forget that the second layer should start after the first with a time delay.
Split brain — a situation in which different segments of a cluster can create/elect a new master and think that the problem is solved.
This is one but not the only problem that Repmgr helped me to solve.
Essentially, this is a manager that can:
What comes to the rescue in our case is repmgrd, which gets started by the primary process to monitor the integrity of the cluster. In situations when you have lost access to master, Repmgr attempts to parse the current structure of the cluster and determine which will be the next master. Certainly, Repmgr is intelligent enough to elect the only correct master and avoid creating a Split Brain situation.
The last component of the system is Pgpool. As I stated in the section about bad decisions, the service still does its job:
As a conclusion, I have created quite a simple Docker Image, which at startup configures itself to work with a set of nodes and users who will be able to be authorized through MD5 authorization via Pgpool (this bit happened to be a bit more complicated than I thought)
Far too often you need to get rid of a single point of failure to suffice, and in our situation this point is a pgpool service that is proxying all requests and thus can become the weakest chain in the path to data access. Fortunately, k8s comes handy allowing us to create as many replications of the service as needed.
Sadly that this Kubernetes example is missing the part about creating replications, but if you are aware of how Replication Controller and/or Deployment works, it’ll be no trouble to pull the configuration off.
This article is not a retelling of the scripts to solve the problem, rather, it is a description of the structure of the solution. The former means that to better understand and optimize the solution, you will have to read the code, at least README.md on github, which is step-by-step and meticulously explains how to start up the cluster for docker-compose and Kubernetes. In addition, for those who will appreciate what I wrote and decide to move beyond the proposed solution, I am willing to lend a virtual helping hand.
I hope you have found the article helpful. Have a positive summer everyone! Good luck and good mood, colleagues ;)
The original article was written in Russian, so thanks to Sergey Rodin (Technical Writer, Lazada)and Tom Wickings (Technical Project Manager, Infrastructure Department, Lazada) for help to translate it!