Losing a server without prior warnings is no longer a question of if, but when it happens — how to ensure MongoDB high availability is not at stake?
This article provides a solution to automatically spin a replacement server when one of the server your replica set faces an unexpected server failure in order. This approach will help reduce the overall time your replica set is exposed to fault tolerance value of 0.
The intended readers of this article are Operations, DevOps teams with basic knowledge of AWS products, MongoDB enthusiasts and all the technologist who want to know anything and everything. The technologies used in this article are MongoDB, AWS EC2, Lambda, Route 53, DynamoDB,CloudWatch, VPC, Auto Scaling, Ansible, IAM and Python
In an unexpected server failure situation, the fault tolerance of replica set may drop to 0. In other words, the replica set is prone to having “no primary” should one more server fail. This article helps you
Although the article discussed the solution for Amazon Web Services, you could apply the knowledge base for other platforms such as Google Cloud Platform, Pivotal Cloud Foundry or Kubernetes.
The article outlines the following topics in detail
Considering that this article gets deep into details, I would like to give an executive brief on ‘How it’s done?’. Unlike the stateless web server, MongoDB database server is all about state. So, I have used
Hopefully that’s short and sweat summary!
I would like to thank Alex Komyagin, Gupta Garuda and Anant Srivastava for their valuable time in reviewing this article.
The solution discussed in this article will not help you attain higher fault tolerance. It primarily helps reduce the duration of your replica set exposed with fault tolerance value of 0. I want you to be aware that neither this article nor the solution discussed in here is officially backed/supported by MongoDB Inc. So, feel free use it at your own risk!
The source code is available at repo: https://github.com/sarjarapu/whitepapers/tree/master/mongodb/respawn/code.
Being a consulting engineer on the field, I work with different client-side teams consisting of operations, DBAs, architects and developers. Many of these clients hosts their MongoDB database servers on cloud providers such as Amazon Web Services, Google Cloud Platform etc. Some of the larger clients with the big footprint in Amazon AWS and/or who has been using AWS for a few years have come across issues like below
One of the very interesting and challenging questions surfaced while working with such clients was
“Losing an EC2 instance without prior warnings is no longer a question of if, but rather when it does happen — what actions can we take to ensure MongoDB is highly available?”
Like any responsible operations team, these teams are highly concerned about the worst-case scenarios and wanted to be proactive by having contingency plans to act upon. With one of the replica set member facing unexpected server failure, the fault tolerance of the replica set could be at 0. In such scenarios your replica set is in a vulnerable state as it cannot sustain an additional system failures to be highly available.
To restore the fault tolerance back to normal, the process typically requires
I would like to highlight that this approach requires manual intervention from an operations team and consumes a lot of time. Apart from re-spawning a new server tasks, the operations have to be highly vigilant of overall system health; ensuring no additional failures happen while the fault tolerance of your replica set is at 0. This article offers you a solution, which may help you sleep better through the night in spite of server failures. So, let’s get started!
MongoDB offers redundancy and high availability of the database via replication. Whenever a member in replica set goes down, the fault tolerance count of the MongoDB replica set is reduced by 1.
If a majority of the replica set members are inaccessible or unavailable then you cannot to elect a primary member and the replica set would not be able to accept writes. So it is crucial to have a primary member or have enough members to elect a primary. Most importantly, reduce the time you are exposed to vulnerable state of losing the primary with a loss of an additional member.
If you are scared to death of losing yet another member and not have primary member then you should have higher fault tolerance than what you currently have. Referring to the number of members v/s fault tolerance matrix, to attain peace of mind with fault tolerance of 2, you would at least need a replica set of 5 members. However, if you cannot afford a 5 member replica set, then this article may help take steps to reduce the time period during which the fault tolerance of your replica set is 0.
If a replica set member becomes unresponsive and cannot be brought back up without replacing the entire server then you would need manual intervention to resolve the issue. As discussed in the What’s the solution? summary section, you would need to do the following
The below image shows a scenario with a replica set of 3 members where Server 03
is unresponsive. The disk volume - Disk 3
is STALE
when compared to the other two members as the data is not being replicated to Server 03
.
MongoDB Server Replacement
Depending on the DataSize, a full data synchronization from the scratch on a brand new server takes a lot of time. So, first lets discuss on how to make the data synchronization faster
You could achieve faster syncs by reusing the data volumes
Disk 3
from Server 03
Disk 3
on the replacement server Server 04
Doing so one could completely avoid the initial sync and just do the synchronization of oplog data that was not replicated yet. In other words, if your old/dead server is replaced in 10 mins, then replacement server just needs to replicate the oplog entries happened during those 10 mins.
A few assumptions made while considering reuse of existing disk volumes to achieve faster sync time are
Previously mentioned solution works fine, but still requires manual intervention to do the following
The automation process is the core of this article and gets thick real quick and definitely not for the faint hearts. So, I would like to introduce you the automation process in the order of increasing complexity. Before we get into the details, here is high-level overview on how you could automate the above steps.
High-level overview: When the replacement server Server 04
comes up, a shell script will mount all the disk volumes tagged to Server 03
. AWS CloudWatch will invoke a AWS Lambda function based on 'running' event. The Lambda function, written in Python, will update the DNS entry on the AWS Route 53 so that the FQDN of the Server 03
instance is now mapped onto the IP address of Server 04
.
In order to automate the process of mounting the disk volumes onto an EC2 instance, first you need to identify/search for the resources. I am using the AWS resource tags to search for the EC2 instances and their associated disk volumes. Below image shows a few tags associated with disk volume — ‘/dev/xvdb’ mounted on an EC2 instance skamon_demoapp_rs03
.
AWS Resource Tags
If EC2 instance associated with instance_name
tag: skamon_demoapp_rs03
is deemed unresponsive then you need to provision a new replacement EC2 instance for it. When the new server boots, if it has not already mounted, an Ansible script will fetching the disk volume using tags and mounts them onto the current EC2 instance. You could potentially use any tools at your disposal including AWS Command Line Interface, Puppet, Chef etc for achieving the same.
None of the members in the replica set can work with Server 04
as the server is not part of the replica set yet. However, reconfiguring a replica set can cause the current primary member to step down, which may trigger an election for a new primary. You could completely avoid the reconfiguration of the replica set by using the Fully Qualified Domain Name (FQDN) of the servers in your replica set configuration than using default private DNS name, which is unique for every EC2 instance.
Rest of this section discusses about the implementation of
I have created a VPC with Public and Private Subnets as shown in the below diagram.
AWS VPC Components
The detailed steps for creating VPC, subnets etc. is out of scope for the current article. If you are not aware of the concepts in VPC or not sure about how to isolate database servers from the public internet then I highly recommend you to read the articles What is Amazon VPC and VPC with Public and Private Subnets. I also found the below videos to be extremely useful.
As mentioned earlier, you could completely avoid reconfiguration of the replica set by using the Fully Qualified Domain Name (FQDN) of the servers in your replica set configuration than using default private DNS name. In this section, I shall go into the details of setting up a Dynamic DNS in AWS using Amazon Route53 and a slew of other products.
While the default VPC DNS can provide basic name resolution for your VPC, the Route 53 private hosted zones offer richer functionality by comparison. Route 53 allows customers to modify DNS records via web service calls. Using this API one can automate the creation/removal of records sets and hosted zones. — Jeremy Cowan & Efrain Fuentes
I would like to thank Jeremy Cowan and Efrain Fuentes from Amazon Web Services team for compiling a very interesting article — ‘Building a Dynamic DNS for Route 53 using CloudWatch Events and Lambda’. I highly recommend you to read the above article to know more about Dynamic DNS.
Below diagram depicts various events that will trigger behind the scenes and shows how various systems work together when a new replacement server is taking over the place of a dead server. Please read through the sequence of events section to follow along with the diagram.
Sequence of events while spawing a new server
The sequence of events:
Server 03
(ip-10.12.230.160) in the replica set is deadServer 03
as 'Unreachable'Server 04
(ip-10.12.230.200) is provisionedDisk 03
onto itself (Server 04
)Disk 03
using resource tagsServer 04
is triggeredServer 04
and Updates A, PTR, and CNAME records in Route53 (replacing Server 03
with Server 04
)skamon_demoapp_rs03
, is now resolved to Server 04
The final piece of the solution that needs to be automated is ‘spawning a new MongoDB replacement server’ automatically when a failure is detected. You can use Auto Scaling to detect an impaired Amazon EC2 instances or unhealthy applications and replace the instances without your intervention. If you are new to Auto Scaling, then I strongly advise you to view Michael Hanisch webinar on Automating management of Amazon EC2 instances with Auto Scaling
To implement the automatic provisioning of replacement hardware
Once the auto scaling group is configured as above, when one of the servers is terminated then a new EC2 instance is automatically created. You may configure the health check rules of this group to fit your needs.
I want to make sure you understand that the feature ‘AWS Auto Scaling’ is used purely to help ensure that you are running the desired number of Amazon EC2 instances rather than the scalability of your MongoDB database which is achieved through proper use of MongoDB Sharding.
If you find the use of ‘AWS Auto Scaling’ feature for a MongoDB database is an interesting use case, then you may also like Other use case scenarios listed in the next section.
Combining the ‘AWS Auto Scaling’ approach along with the ‘MongoDB rolling update’ feature you could upgrade the underlying hardware configuration of replica set members, perform the MongoDB version upgrade, etc. To achieve this, you must rebuild or Patch an AMI and Update an Auto Scaling Group and recreate all EC2 instances of the replica set members in a rolling upgrade fashion.
As much as I like the use case of upgrading the hardware, using this approach for a trivial process such as MongoDB version upgrade is an overkill in my opinion. A typical MongoDB version upgrade involves an in-place binary version upgrade, which is a lot simpler and less time consuming than spawning an entire EC2 instance with an upgraded AMI. I did come across a client who prefered to patch AMI over the simplicity to keep the overall operations process consistent. So it’s purely your personal preference.
While the current solution is adequate to some of you, there is still scope for further improvement. A few improvisations that I thought through are
The current solution uses only one auto scaling group. If your MongoDB deployment is geographically distributed across multiple data centers all the time, then you may have to create one auto scaling group per each data center. This will help ensure that there is at least one server up and running per data center.
If your MongoDB deployment is spread across three data centers then inherently you are covered with MongoDB’s High Availability. Below picture shows a scenario with one of the replica set member is failed along with a network partition scenario. Despite spawning a new MongoDB server the fault tolerance of the replica set would still be 0 because of the network partition.
Multiple Data Centers with Network Partition
When the Ansible script runs on the replacement server, it searches for available disks using tag names associated with server_group_name
. If two or more replica set members are down at the very same moment, then the scripts from both the replacement servers may try to mount the same disk. I am currently using a random delay before mounting disks as a poorman's mechanism to overcome the possibility of two replacement servers trying to mount the same disk volume at same time.
The Ansible script makes use of the AWS credentials from the replacement server to search and mount the disk volumes. Therefore the AWS credentials file must be copied onto the EC2 instance and should be secured in such a way that only the user running the cron job has access to it.
The logic baked into the Ansible scripts can be moved into the AWS Lambda function to completely eliminate the need for a cron job invoking Ansible scripts.
If you could think of improvisations that are not listed in here then feel free to drop me note in the comments.
I strongly recommend you to review the mongod.logs to diagnose and fix the real issue rather than blindly replace the server in trouble. Having the number of servers up to maintain the fault tolerance is utmost importance, however not fixing the issue that caused this situation in the first place is like kicking the can further down the road.
It could be possible that your server went down for reasons that could have been addressed elsewhere. Example: The mongod got killed by OOM Killer. In this scenario, spawning a new MongoDB server is no different than restarting the mongod process. As an alternative, allocating swap space will avoid issues with memory contention and prevent the OOM Killer on Linux systems from killing mongod.
Finally, I want to congratulate you for making it this far of the lengthy article. If you are new to some of the technologies listed in here, then I could only imagine how much of an information overload you been through. You all deserve a pat on your shoulders for this awesome feat.
If you are planning to use these concepts in your environment or you think of any interesting use cases not listed above or I happened to stir some inspiration in you to customize it further, please drop a comment about it.