Losing a server without prior warnings is no longer a question of if, but when it happens — how to ensure MongoDB high availability is not at stake?
This article provides a solution to automatically spin a replacement server when one of the server your replica set faces an unexpected server failure in order. This approach will help reduce the overall time your replica set is exposed to fault tolerance value of 0.
Who should read it?
The intended readers of this article are Operations, DevOps teams with basic knowledge of AWS products, MongoDB enthusiasts and all the technologist who want to know anything and everything. The technologies used in this article are MongoDB, AWS EC2, Lambda, Route 53, DynamoDB,
CloudWatch, VPC, Auto Scaling, Ansible, IAM and Python
Why should you read it?
In an unexpected server failure situation, the fault tolerance of replica set may drop to 0. In other words, the replica set is prone to having “no primary” should one more server fail. This article helps you
- Build knowledge base on the above situation
- Steps required to restore the fault tolerance value > 0
- Automate the process in AWS
What will you learn?
The article outlines the following topics in detail
- Worst case scenarios when fault tolerance of replica sets is 0
- Typical process followed by operations team to resolve issue
- Ways to speed up the process replacement via automation
- Pros and Cons of the re-spawn approach
- Scope for enhancements to fit your needs
What’s the solution?
Considering that this article gets deep into details, I would like to give an executive brief on ‘How it’s done?’. Unlike the stateless web server, MongoDB database server is all about state. So, I have used
- CloudWatch, Lambda to detect a dead/deteriorating EC2 instance
- Auto Scaling to re-spawn and replace a dead server
- Ansible & Resource Tags to associate disk volumes to EC2 instances
- Route53, Lambda to resolve/update the DNS/server names
- Let the new replica set member sync up automatically
Hopefully that’s short and sweat summary!
The solution discussed in this article will not help you attain higher fault tolerance. It primarily helps reduce the duration of your replica set exposed with fault tolerance value of 0. I want you to be aware that neither this article nor the solution discussed in here is officially backed/supported by MongoDB Inc. So, feel free use it at your own risk!
The source code is available at repo: https://github.com/sarjarapu/whitepapers/tree/master/mongodb/respawn/code.
Being a consulting engineer on the field, I work with different client-side teams consisting of operations, DBAs, architects and developers. Many of these clients hosts their MongoDB database servers on cloud providers such as Amazon Web Services, Google Cloud Platform etc. Some of the larger clients with the big footprint in Amazon AWS and/or who has been using AWS for a few years have come across issues like below
- Corrupt EC2 instances
- Could not SSH to the server
- Amazon retired your instance etc.
One of the very interesting and challenging questions surfaced while working with such clients was
“Losing an EC2 instance without prior warnings is no longer a question of if, but rather when it does happen — what actions can we take to ensure MongoDB is highly available?”
Like any responsible operations team, these teams are highly concerned about the worst-case scenarios and wanted to be proactive by having contingency plans to act upon. With one of the replica set member facing unexpected server failure, the fault tolerance of the replica set could be at 0. In such scenarios your replica set is in a vulnerable state as it cannot sustain an additional system failures to be highly available.
To restore the fault tolerance back to normal, the process typically requires
- Diagnosis of the server issue
- Analyze if replacement of the server is warranted
- If replacement of server is required
- Re-configure the MongoDB replica set
- Data synchronization on the new server
- Update application connection string to use new server
I would like to highlight that this approach requires manual intervention from an operations team and consumes a lot of time. Apart from re-spawning a new server tasks, the operations have to be highly vigilant of overall system health; ensuring no additional failures happen while the fault tolerance of your replica set is at 0. This article offers you a solution, which may help you sleep better through the night in spite of server failures. So, let’s get started!
If a majority of the replica set members are inaccessible or unavailable then you cannot to elect a primary member and the replica set would not be able to accept writes. So it is crucial to have a primary member or have enough members to elect a primary. Most importantly, reduce the time you are exposed to vulnerable state of losing the primary with a loss of an additional member.
If you are scared to death of losing yet another member and not have primary member then you should have higher fault tolerance than what you currently have. Referring to the number of members v/s fault tolerance matrix, to attain peace of mind with fault tolerance of 2, you would at least need a replica set of 5 members. However, if you cannot afford a 5 member replica set, then this article may help take steps to reduce the time period during which the fault tolerance of your replica set is 0.
The solution — Manual version
If a replica set member becomes unresponsive and cannot be brought back up without replacing the entire server then you would need manual intervention to resolve the issue. As discussed in the What’s the solution? summary section, you would need to do the following
- Replace the unresponsive server with the a new replacement server
- Reconfigure the replica set with new replacement server
- Wait for data synchronization to complete
- Update the application connection string with new server
The below image shows a scenario with a replica set of 3 members where
Server 03 is unresponsive. The disk volume -
Disk 3 is
STALE when compared to the other two members as the data is not being replicated to
Depending on the DataSize, a full data synchronization from the scratch on a brand new server takes a lot of time. So, first lets discuss on how to make the data synchronization faster
Faster data synchronization
You could achieve faster syncs by reusing the data volumes
- Un-mount the MongoDB data volume —
- Mount the data volume —
Disk 3on the replacement server
Doing so one could completely avoid the initial sync and just do the synchronization of oplog data that was not replicated yet. In other words, if your old/dead server is replaced in 10 mins, then replacement server just needs to replicate the oplog entries happened during those 10 mins.
A few assumptions made while considering reuse of existing disk volumes to achieve faster sync time are
- Use of separate data volume: The MongoDB data files are written onto separate disk volume than underlying Operating System disks so that you could un-mount/re-mount elsewhere.
- Re-mountable disks: EBS volumes can be attached to replacement EC2 instance. However, the EC2 Instance Store / Ephemeral disks that are physically attached and cannot be mounted on to another instance.
- Disk or data is not corrupted: If the disk volume or MongoDB data files are corrupted, then re-mounting the same disk volume on the replacement server would still be in the same corrupted state. To fix corrupted disk/data you may have to full initial sync on a new disk volume
The solution — Automated version
Previously mentioned solution works fine, but still requires manual intervention to do the following
- Creating a replacement servers
- Reconfiguring the replica set with newer member
- Mounting of re-used disk volume & data synchronization
The automation process is the core of this article and gets thick real quick and definitely not for the faint hearts. So, I would like to introduce you the automation process in the order of increasing complexity. Before we get into the details, here is high-level overview on how you could automate the above steps.
High-level overview: When the replacement server
Server 04 comes up, a shell script will mount all the disk volumes tagged to
Server 03. AWS CloudWatch will invoke a AWS Lambda function based on 'running' event. The Lambda function, written in Python, will update the DNS entry on the AWS Route 53 so that the FQDN of the
Server 03 instance is now mapped onto the IP address of
Automate: Mounting of data volumes
In order to automate the process of mounting the disk volumes onto an EC2 instance, first you need to identify/search for the resources. I am using the AWS resource tags to search for the EC2 instances and their associated disk volumes. Below image shows a few tags associated with disk volume — ‘/dev/xvdb’ mounted on an EC2 instance
If EC2 instance associated with
skamon_demoapp_rs03 is deemed unresponsive then you need to provision a new replacement EC2 instance for it. When the new server boots, if it has not already mounted, an Ansible script will fetching the disk volume using tags and mounts them onto the current EC2 instance. You could potentially use any tools at your disposal including AWS Command Line Interface, Puppet, Chef etc for achieving the same.
Avoid replica set reconfiguration
None of the members in the replica set can work with
Server 04 as the server is not part of the replica set yet. However, reconfiguring a replica set can cause the current primary member to step down, which may trigger an election for a new primary. You could completely avoid the reconfiguration of the replica set by using the Fully Qualified Domain Name (FQDN) of the servers in your replica set configuration than using default private DNS name, which is unique for every EC2 instance.
Rest of this section discusses about the implementation of
- VPC Setup & Components
- Dynamic DNS
VPC Setup & Components
I have created a VPC with Public and Private Subnets as shown in the below diagram.
The detailed steps for creating VPC, subnets etc. is out of scope for the current article. If you are not aware of the concepts in VPC or not sure about how to isolate database servers from the public internet then I highly recommend you to read the articles What is Amazon VPC and VPC with Public and Private Subnets. I also found the below videos to be extremely useful.
As mentioned earlier, you could completely avoid reconfiguration of the replica set by using the Fully Qualified Domain Name (FQDN) of the servers in your replica set configuration than using default private DNS name. In this section, I shall go into the details of setting up a Dynamic DNS in AWS using Amazon Route53 and a slew of other products.
While the default VPC DNS can provide basic name resolution for your VPC, the Route 53 private hosted zones offer richer functionality by comparison. Route 53 allows customers to modify DNS records via web service calls. Using this API one can automate the creation/removal of records sets and hosted zones. — Jeremy Cowan & Efrain Fuentes
I would like to thank Jeremy Cowan and Efrain Fuentes from Amazon Web Services team for compiling a very interesting article — ‘Building a Dynamic DNS for Route 53 using CloudWatch Events and Lambda’. I highly recommend you to read the above article to know more about Dynamic DNS.
Below diagram depicts various events that will trigger behind the scenes and shows how various systems work together when a new replacement server is taking over the place of a dead server. Please read through the sequence of events section to follow along with the diagram.
The sequence of events:
Server 03(ip-10.12.230.160) in the replica set is dead
- The other members of replica set see’s the
Server 03as 'Unreachable'
- Fault tolerance of replica set is now dropped to 0, with only 2 out of 3 member are up and running.
- A new replacement server
Server 04(ip-10.12.230.200) is provisioned
- The Ansible script on new server copies various tags from
Disk 03onto itself (
- The script updates hostname of server, finds & mounts the
Disk 03using resource tags
- Cloud Watch event for ‘EC2 instance state: running’ on
Server 04is triggered
- AWS Lambda function subscribed to above Cloud Watch event runs the Python code
- The Python code fetches CNAME / ZONE resource instance tags on
Server 04and Updates A, PTR, and CNAME records in Route53 (replacing
- The DNS name,
skamon_demoapp_rs03, is now resolved to
- The replication process starts and keeps all the members synchronized and the fault tolerance of the replica set will be back to 1
Automate: Provisioning the replacement server
The final piece of the solution that needs to be automated is ‘spawning a new MongoDB replacement server’ automatically when a failure is detected. You can use Auto Scaling to detect an impaired Amazon EC2 instances or unhealthy applications and replace the instances without your intervention. If you are new to Auto Scaling, then I strongly advise you to view Michael Hanisch webinar on Automating management of Amazon EC2 instances with Auto Scaling
To implement the automatic provisioning of replacement hardware
- Create and configure a new auto scaling group
- With the EC2 instances of your MongoDB replica set in the group
- Set the group to always maintain a min/max count = replica set member count
- Use AMI with all required tools and softwares in the launch configuration
Once the auto scaling group is configured as above, when one of the servers is terminated then a new EC2 instance is automatically created. You may configure the health check rules of this group to fit your needs.
I want to make sure you understand that the feature ‘AWS Auto Scaling’ is used purely to help ensure that you are running the desired number of Amazon EC2 instances rather than the scalability of your MongoDB database which is achieved through proper use of MongoDB Sharding.
If you find the use of ‘AWS Auto Scaling’ feature for a MongoDB database is an interesting use case, then you may also like Other use case scenarios listed in the next section.
Use cases & Scope for improvement
Other use case scenarios
Combining the ‘AWS Auto Scaling’ approach along with the ‘MongoDB rolling update’ feature you could upgrade the underlying hardware configuration of replica set members, perform the MongoDB version upgrade, etc. To achieve this, you must rebuild or Patch an AMI and Update an Auto Scaling Group and recreate all EC2 instances of the replica set members in a rolling upgrade fashion.
As much as I like the use case of upgrading the hardware, using this approach for a trivial process such as MongoDB version upgrade is an overkill in my opinion. A typical MongoDB version upgrade involves an in-place binary version upgrade, which is a lot simpler and less time consuming than spawning an entire EC2 instance with an upgraded AMI. I did come across a client who prefered to patch AMI over the simplicity to keep the overall operations process consistent. So it’s purely your personal preference.
Scope for improvement
While the current solution is adequate to some of you, there is still scope for further improvement. A few improvisations that I thought through are
Cross data center support
The current solution uses only one auto scaling group. If your MongoDB deployment is geographically distributed across multiple data centers all the time, then you may have to create one auto scaling group per each data center. This will help ensure that there is at least one server up and running per data center.
If your MongoDB deployment is spread across three data centers then inherently you are covered with MongoDB’s High Availability. Below picture shows a scenario with one of the replica set member is failed along with a network partition scenario. Despite spawning a new MongoDB server the fault tolerance of the replica set would still be 0 because of the network partition.
Multiple failures at same time
When the Ansible script runs on the replacement server, it searches for available disks using tag names associated with
server_group_name. If two or more replica set members are down at the very same moment, then the scripts from both the replacement servers may try to mount the same disk. I am currently using a random delay before mounting disks as a poorman's mechanism to overcome the possibility of two replacement servers trying to mount the same disk volume at same time.
The Ansible script makes use of the AWS credentials from the replacement server to search and mount the disk volumes. Therefore the AWS credentials file must be copied onto the EC2 instance and should be secured in such a way that only the user running the cron job has access to it.
Eliminate cron jobs
The logic baked into the Ansible scripts can be moved into the AWS Lambda function to completely eliminate the need for a cron job invoking Ansible scripts.
If you could think of improvisations that are not listed in here then feel free to drop me note in the comments.
I strongly recommend you to review the mongod.logs to diagnose and fix the real issue rather than blindly replace the server in trouble. Having the number of servers up to maintain the fault tolerance is utmost importance, however not fixing the issue that caused this situation in the first place is like kicking the can further down the road.
It could be possible that your server went down for reasons that could have been addressed elsewhere. Example: The mongod got killed by OOM Killer. In this scenario, spawning a new MongoDB server is no different than restarting the mongod process. As an alternative, allocating swap space will avoid issues with memory contention and prevent the OOM Killer on Linux systems from killing mongod.
Finally, I want to congratulate you for making it this far of the lengthy article. If you are new to some of the technologies listed in here, then I could only imagine how much of an information overload you been through. You all deserve a pat on your shoulders for this awesome feat.
If you are planning to use these concepts in your environment or you think of any interesting use cases not listed above or I happened to stir some inspiration in you to customize it further, please drop a comment about it.