Mastering MongoDB - Faster elections during rolling maintenance

Photo by annie bolin on Unsplash The maintenance/upgrade in a is typically performed in a rolling fashion. The rolling maintenance/upgrade process requires you to perform the maintenance on one at a time with the member to go through the maintenance last. MongoDB replica set Secondary Primary stepDown When you the Primary, all the eligible Secondaries will hold an election for a new Primary. Until a new Primary is elected the database is not available for writes. So, ‘How would you quickly elect a new Primary while performing the rolling maintenance/upgrade?’ This is one of the many articles in multi-part series, , solely created for you to master by learning . In a few series of articles, I would like to give various tips to help you answer the above question. Mastering MongoDB — One tip a day MongoDB ‘One tip a day’ This article discusses the rolling maintenance, implications of not having the , the steps required to elect a new quickly and finally Pros/Cons of the approach. Primary Primary Mastering — Rolling maintenance MongoDB offers redundancy and high availability of the database via replica sets. The replica sets will not only help the database quickly recover from node failures/network partitions, but also gives you the ability to perform maintenance tasks without affecting the high availability. The key to being highly available and yet be able to perform maintenance is ‘the rolling maintenance’; Where the maintenance is performed on one at a time Secondary Stop the MongoDB process/service on a Secondary Perform the required maintenance/upgrade on the server Start the MongoDB process/service on the server Wait for MongoDB on the server to catch up on the Oplog Repeat the above on the other secondaries in the replica set Given a replica set with 3 MongoDB servers — mon01 mon02 and mon03 the rolling maintenance process typically requires (Primary), (Secondary) (Secondary), Perform the maintenance on the server, mon03 Secondary Perform the maintenance on the other server, mon02 Secondary the server, mon01 stepDown Primary Wait for new to be elected, let’s say mon02 Primary Perform the maintenance on the former server, mon01 primary For more detailed information on the rolling upgrades please read, by Bryan Reinero. Your Ultimate Guide to Rolling Upgrades Implications of not having a Primary By default, both the read/write operations are executed on the Primary. You may use the read preference to read from one of the . However, the is the only member in the replica set that receives write operations. So it is crucial to always have a Primary in your replica set. Secondary Secondaries Primary When a is not available/reachable by the majority of , all the eligible will hold an election for a new . Until a new primary is elected, all the Write and Reads (on Primary) operations originating from your client drivers will either wait for to be available and/or timeout. So, it is important to have Primary elected quickly so that number of operations awaiting for the Primary to be available are low. Primary Secondaries Secondaries Primary Primary How to quickly elect a new Primary If a Primary is unexpectedly terminated and/or facing a network connectivity issues from the majority of servers, the secondaries can call in for an election after missing the heartbeats for 10 seconds. So it takes some time. StepDown the Primary Stepping down the primary expedites the failover procedure. Therefore it is recommended to the primary to forcefully trigger the election than the and let the find out about unreachable primary. I bet, most of you are using this approach already. So, let’s review some of the other tips you could leverage before you stepDown() the Primary. stepDown shutDown Primary Secondaries Make only one Secondary to be electable If the replication lag on one of your secondary is low, then you can pro-actively choose it to be the only secondary that can be elected for the next election. Typically you choose a secondary that has Low replication lag Low network latency Similar Priority as current Primary Or Member with next highest Priority Assuming you want to pin a server, mon02, as the next then you can make the server, mon03, ineligible to become for 60 seconds by running rs.freeze(60) on it. This will make the election faster as the server, mon02, is the only electable , when you the server mon01. Secondary Primary Secondary Primary Secondary Primary stepDown Reduce the settings.electionTimeoutMillis The default time limit for detecting when a replica set’s primary is unreachable is 10 seconds. By reducing the settings.electionTimeoutMillis to let’s say 2 seconds, you would be making the detection and hence the election faster. Summary of steps for faster election I have summarized the below steps to have faster election during the maintenance period. Please test them before running them on production environment. Identify the server you want it to be next Primary Execute rs.freeze(60) on all other Secondaries Set settings.electionTimeoutMillis=2000 on replica set configuration Execute rs.stepDown() on current Primary Wait for the new Primary to be elected Reset the settings.electionTimeoutMillis=10000 on the new Primary Pros & Cons of the approach Assuming all the above suggestions worked out well for you. You may be wondering - “If having lower electionTimeoutMillis helps with quicker elections, then why can’t I keep it at lower number all the time?” Great question! Your application might be facing reduced traffic during the rolling maintenance period. Most importantly, you are closely monitoring all the servers and manually pin a to be the next . So it could be okay for you to have a lower electionTimeoutMillis value at that very moment. Secondary Primary However, setting the electionTimeoutMillis to a low value will not only result in faster failover but also has a negative effect on increased sensitivity to the primary node or network slowness or spottiness. This may result in too many elections when there are transient network connectivity issues. On the contrary, setting the electionTimeoutMillis to larger value makes your replica set more resilient to transient network interruptions but also results in slower average failover time. Bottomline is YMMV; You would need to test various electionTimeoutMillis values and choose the one that suites you better. Or leave it at the default value of 10 seconds. No matter what you do, “Never set the electionTimeoutMillis to a value less than the round-trip network latency time between two of your members.” Hands-On lab exercises This lab exercise helps you understand the steps needed to quickly elect a new Primary during a rolling maintenance. Setup environment First, you would need an environment to play around. I have created 3 RHEL v7.5 instances in AWS, you may as well run them all on your localhost with /etc/hosts entries for the servers. If you already have a MongoDB v3.6 replica set environment, you may skip this step. Download and untar MongoDB v3.6 binaries, start MongoDB server listening to bind all IPs on port 27000. curl -O https://fastdl.mongodb.org/linux/mongodb-linux-x86_64-rhel70-3.6.5.tgz BIN_NAME= BIN_VERSION= tar -xvzf rm mv rm -rf data mkdir data /bin/mongod --dbpath data --logpath data/mongod.log --fork --replSet rs0 --port 27000 --bind_ip_all tail -3 /etc/hosts # Run these commands on all the 3 servers of yours. # download v3.6 "mongodb-linux-x86_64-rhel70-3.6.5" "v3.6.5" # create data directory and untar the binaries " .tgz" $BIN_NAME " .tgz" $BIN_NAME $BIN_NAME $BIN_VERSION # start the mongod on port 27000 with bind_ip_all $BIN_VERSION # about to fork child process, waiting until server is ready for connections. # forked process: 13442 # child process started successfully, parent exiting # edit /etc/hosts to have the mon0X entries, ofcourse with your own IPs # 35.167.113.204 mon01 # 35.167.113.203 mon02 # 35.167.113.206 mon03 A bash script with download MongoDB v3.6.5 and start mongod on port 27000 Initiate replica set Initiate a MongoDB replica set using the above hosts on server mon01 /bin/mongo --port 27000 <<EOF rs.initiate({ _id: , members: [ { _id: 0, host : }, { _id: 1, host : }, { _id: 2, host : } ] }) EOF /bin/mongo --port 27000 rs.isMaster().me $BIN_VERSION 'rs0' 'mon01:27000' 'mon02:27000' 'mon03:27000' # MongoDB shell version v3.6.5 # connecting to: mongodb://127.0.0.1:27000/ # MongoDB server version: 3.6.5 # { # "ok" : 1, # "operationTime" : Timestamp(1529022819, 1), # "$clusterTime" : { # "clusterTime" : Timestamp(1529022819, 1), # "signature" : { # "hash" : BinData(0,"AAAAAAAAAAAAAAAAAAAAAAAAAAA="), # "keyId" : NumberLong(0) # } # } # } # bye $BIN_VERSION # rs0:PRIMARY> # mon01:27000 A bash script to initiate a replica set with 3 hosts we created earlier Display the replica set config and status Please note the outputs from rs.config() and rs.status() respectively. They help you determine current settings.electionTimeoutMillis: 10000 and select a to be the next based on the values in priority, optime, lastHeartbeat and pingMs. Secondary Primary rs.config() /* { "_id": "rs0", "version": 1, "protocolVersion": NumberLong(1), "members": [{ "_id": 0, "host": "mon01:27000", "arbiterOnly": false, "buildIndexes": true, "hidden": false, "priority": 1, "tags": { }, "slaveDelay": NumberLong(0), "votes": 1 }, { "_id": 1, "host": "mon02:27000", "arbiterOnly": false, "buildIndexes": true, "hidden": false, "priority": 1, "tags": { }, "slaveDelay": NumberLong(0), "votes": 1 }, { "_id": 2, "host": "mon03:27000", "arbiterOnly": false, "buildIndexes": true, "hidden": false, "priority": 1, "tags": { }, "slaveDelay": NumberLong(0), "votes": 1 } ], "settings": { "chainingAllowed": true, "heartbeatIntervalMillis": 2000, "heartbeatTimeoutSecs": 10, "electionTimeoutMillis": 10000, "catchUpTimeoutMillis": -1, "catchUpTakeoverDelayMillis": 30000, "getLastErrorModes": { }, "getLastErrorDefaults": { "w": 1, "wtimeout": 0 }, "replicaSetId": ObjectId("5b23096362ac76fdc504e6e1") } } */ A JavaScript method to show the replica set configuration settings rs.status() /* { "set": "rs0", "date": ISODate("2018-06-15T00:57:24.708Z"), "myState": 1, "term": NumberLong(1), "heartbeatIntervalMillis": NumberLong(2000), "optimes": { "lastCommittedOpTime": { "ts": Timestamp(1529024241, 1), "t": NumberLong(1) }, "readConcernMajorityOpTime": { "ts": Timestamp(1529024241, 1), "t": NumberLong(1) }, "appliedOpTime": { "ts": Timestamp(1529024241, 1), "t": NumberLong(1) }, "durableOpTime": { "ts": Timestamp(1529024241, 1), "t": NumberLong(1) } }, "members": [{ "_id": 0, "name": "mon01:27000", "health": 1, "state": 1, "stateStr": "PRIMARY", "uptime": 1837, "optime": { "ts": Timestamp(1529024241, 1), "t": NumberLong(1) }, "optimeDate": ISODate("2018-06-15T00:57:21Z"), "electionTime": Timestamp(1529022829, 1), "electionDate": ISODate("2018-06-15T00:33:49Z"), "configVersion": 1, "self": true }, { "_id": 1, "name": "mon02:27000", "health": 1, "state": 2, "stateStr": "SECONDARY", "uptime": 1425, "optime": { "ts": Timestamp(1529024241, 1), "t": NumberLong(1) }, "optimeDurable": { "ts": Timestamp(1529024241, 1), "t": NumberLong(1) }, "optimeDate": ISODate("2018-06-15T00:57:21Z"), "optimeDurableDate": ISODate("2018-06-15T00:57:21Z"), "lastHeartbeat": ISODate("2018-06-15T00:57:24.663Z"), "lastHeartbeatRecv": ISODate("2018-06-15T00:57:23.502Z"), "pingMs": NumberLong(0), "syncingTo": "mon01:27000", "configVersion": 1 }, { "_id": 2, "name": "mon03:27000", "health": 1, "state": 2, "stateStr": "SECONDARY", "uptime": 1425, "optime": { "ts": Timestamp(1529024241, 1), "t": NumberLong(1) }, "optimeDurable": { "ts": Timestamp(1529024241, 1), "t": NumberLong(1) }, "optimeDate": ISODate("2018-06-15T00:57:21Z"), "optimeDurableDate": ISODate("2018-06-15T00:57:21Z"), "lastHeartbeat": ISODate("2018-06-15T00:57:24.571Z"), "lastHeartbeatRecv": ISODate("2018-06-15T00:57:23.075Z"), "pingMs": NumberLong(75), "syncingTo": "mon01:27000", "configVersion": 1 } ], "ok": 1, "operationTime": Timestamp(1529024241, 1), "$clusterTime": { "clusterTime": Timestamp(1529024241, 1), "signature": { "hash": BinData(0, "AAAAAAAAAAAAAAAAAAAAAAAAAAA="), "keyId": NumberLong(0) } } } */ A JavaScript method to show the replica set status information Choose the potential next Primary The rs.status() and db.printSlaveReplication() commands show that both the , mon02 and mon03, are all caught up on the Oplog entries of mon01. However, the pingMs shows that mon02 is a lot closer to mon01 than the mon03. So you may choose the mon02 as the next potential Primary while stepping down the current Primary. Secondaries Primary db.printSlaveReplicationInfo() rs.status().members.map( x.pingMs) /* source: mon02:27000 syncedTo: Fri Jun 15 2018 01:29:11 GMT+0000 (UTC) 0 secs (0 hrs) behind the primary source: mon03:27000 syncedTo: Fri Jun 15 2018 01:29:11 GMT+0000 (UTC) 0 secs (0 hrs) behind the primary */ => x // [ undefined, NumberLong(0), NumberLong(78) ] A JavaScript function to show the database printSlaveReplicationInfo command output Freeze the other Secondaries Based on the above pingMs, we would not want the server mon03 to be elected as . So, run the below command to freeze it from contending in the next election term. Primary rs.isMaster().me rs.freeze( ) // Freeze mon03 // mon03:27000 60 /* { "ok" : 1, "operationTime" : Timestamp(1529033711, 1), "$clusterTime" : { "clusterTime" : Timestamp(1529033711, 1), "signature" : { "hash" : BinData(0,"AAAAAAAAAAAAAAAAAAAAAAAAAAA="), "keyId" : NumberLong(0) } } } */ A JavaScript method invoking rs.freeze to make the replica set member ineligible to become primary Set electionTimeoutMillis and stepDown the Primary Reconfigure the electionTimeoutMillis of the replica set settings on the current Primary, mon01. Finally, execute the command rs.stepDown() to forcibly trigger the election and electing mon02 as the next . Primary rs.isMaster().me conf = rs.conf() conf.settings.electionTimeoutMillis= rs.reconfig(conf) rs.stepDown() rs.isMaster().primary // # rs0:PRIMARY> // mon01:27000 var 2000 /* { "ok": 1, "operationTime": Timestamp(1529025863, 1), "$clusterTime": { "clusterTime": Timestamp(1529025863, 1), "signature": { "hash": BinData(0, "AAAAAAAAAAAAAAAAAAAAAAAAAAA="), "keyId": NumberLong(0) } } } */ /* 2018-06-15T03:52:42.042+0000 E QUERY [thread1] Error: error doing query: failed: network error while attempting to run command 'replSetStepDown' on host '127.0.0.1:27000' : DB.prototype.runCommand@src/mongo/shell/db.js:168:1 DB.prototype.adminCommand@src/mongo/shell/db.js:186:16 rs.stepDown@src/mongo/shell/utils.js:1341:12 @(shell):1:1 2018-06-15T03:52:42.043+0000 I NETWORK [thread1] trying reconnect to 127.0.0.1:27000 (127.0.0.1) failed 2018-06-15T03:52:42.043+0000 I NETWORK [thread1] reconnect 127.0.0.1:27000 (127.0.0.1) ok */ // rs0:SECONDARY> // mon02:27000 A JavaScript code to set the electionTimeoutMillis to 2 seconds and stepDown the primary You may notice that the new primary is available within ~2 seconds compared to the default of 10–12 seconds. The below files on the individual machines show that mon02 transition to primary is completed within ~2 seconds. mongod.log tail -100 data/mongod.log | grep REPL tail -100 data/mongod.log | grep REPL # Server mon02 # 2018-06-15T03:52:42.304+0000 I REPL [rsBackgroundSync] could not find member to sync from # 2018-06-15T03:52:42.305+0000 I REPL [replexec-28] Member mon01:27000 is now in state SECONDARY # 2018-06-15T03:52:43.306+0000 I REPL [SyncSourceFeedback] SyncSourceFeedback error sending update to mon01:27000: InvalidSyncSource: Sync source was cleared. Was mon01:27000 # 2018-06-15T03:52:43.360+0000 I REPL [replexec-26] Starting an election, since weve seen no PRIMARY in the past 2000ms # 2018-06-15T03:52:43.360+0000 I REPL [replexec-26] conducting a dry run election to see if we could be elected. current term: 4 # 2018-06-15T03:52:43.360+0000 I REPL [replexec-24] VoteRequester(term 4 dry run) received a yes vote from mon01:27000; response message: { term: 4, voteGranted: true, reason: "", ok: 1.0, operationTime: Timestamp(1529034758, 1), $clusterTime: { clusterTime: Timestamp(1529034758, 1), signature: { hash: BinData(0, 0000000000000000000000000000000000000000), keyId: 0 } } } # 2018-06-15T03:52:43.360+0000 I REPL [replexec-24] dry election run succeeded, running for election in term 5 # 2018-06-15T03:52:43.363+0000 I REPL [replexec-28] VoteRequester(term 5) received a yes vote from mon01:27000; response message: { term: 5, voteGranted: true, reason: "", ok: 1.0, operationTime: Timestamp(1529034758, 1), $clusterTime: { clusterTime: Timestamp(1529034758, 1), signature: { hash: BinData(0, 0000000000000000000000000000000000000000), keyId: 0 } } } # 2018-06-15T03:52:43.363+0000 I REPL [replexec-28] election succeeded, assuming primary role in term 5 # 2018-06-15T03:52:43.363+0000 I REPL [replexec-28] transition to PRIMARY from SECONDARY # 2018-06-15T03:52:43.363+0000 I REPL [replexec-28] Entering primary catch-up mode. # 2018-06-15T03:52:43.584+0000 I REPL [replexec-23] Caught up to the latest optime known via heartbeats after becoming primary. # 2018-06-15T03:52:43.584+0000 I REPL [replexec-23] Exited primary catch-up mode. # 2018-06-15T03:52:45.301+0000 I REPL [rsSync] transition to primary complete; database writes are now permitted # Server mon03 # 2018-06-15T03:52:29.029+0000 I REPL [rsBackgroundSync] sync source candidate: mon01:27000 # 2018-06-15T03:52:31.631+0000 I REPL [conn27] 'freezing' for 120 seconds # 2018-06-15T03:52:42.794+0000 I REPL [replication-1] Choosing new sync source because our current sync source, mon01:27000, has an OpTime ({ ts: Timestamp(1529034758, 1), t: 4 }) which is not ahead of ours ({ ts: Timestamp(1529034758, 1), t: 4 }), it does not have a sync source, and its not the primary (sync source does not know the primary) # 2018-06-15T03:52:42.794+0000 I REPL [replication-1] Canceling oplog query due to OplogQueryMetadata. We have to choose a new sync source. Current source: mon01:27000, OpTime { ts: Timestamp(1529034758, 1), t: 4 }, its sync source index:-1 # 2018-06-15T03:52:42.794+0000 W REPL [rsBackgroundSync] Fetcher stopped querying remote oplog with error: InvalidSyncSource: sync source mon01:27000 (config version: 4; last applied optime: { ts: Timestamp(1529034758, 1), t: 4 }; sync source index: -1; primary index: -1) is no longer valid # 2018-06-15T03:52:42.794+0000 I REPL [rsBackgroundSync] could not find member to sync from # 2018-06-15T03:52:43.420+0000 I REPL [ReplicationExecutor] Not starting an election, since we are not electable due to: Not standing for election because I am still waiting for stepdown period to end at 2018-06-09T13:09:29.408+0000 (mask 0x20) # 2018-06-15T03:52:43.481+0000 I REPL [ReplicationExecutor] Member mon01:27000 is now in state SECONDARY # 2018-06-15T03:52:43.584+0000 I REPL [ReplicationExecutor] Not starting an election, since we are not electable due to: Not standing for election because I am still waiting for stepdown period to end at 2018-06-09T13:09:29.408+0000 (mask 0x20) # 2018-06-15T03:52:43.584+0000 I REPL [ReplicationExecutor] Member mon02:27000 is now in state PRIMARY # 2018-06-15T03:52:43.663+0000 I REPL [replexec-31] Member mon02:27000 is now in state PRIMARY # 2018-06-15T03:52:43.817+0000 I REPL [SyncSourceFeedback] SyncSourceFeedback error sending update to mon01:27000: InvalidSyncSource: Sync source was cleared. Was mon01:27000 # 2018-06-15T03:52:45.795+0000 I REPL [rsBackgroundSync] sync source candidate: mon02:27000 A bash script to show the transition of mon02 from Secondary to Primary Reset the electionTimeoutMillis on the new primary Once the new Primary is elected, please revert back the electionTimeoutMillis back to the default value to avoid any frequent elections during the transient network connectivity issues. rs.isMaster().me conf = rs.conf() conf.settings.electionTimeoutMillis= // mon02:27000 // rs0:PRIMARY> // on the new primary var 10000 /* rs.reconfig(conf) { "ok": 1, "operationTime": Timestamp(1529034252, 1), "$clusterTime": { "clusterTime": Timestamp(1529034252, 1), "signature": { "hash": BinData(0, "AAAAAAAAAAAAAAAAAAAAAAAAAAA="), "keyId": NumberLong(0) } } } */ A JavaScript code to reset the electionTimeoutMillis back to its default value Summary I want to remind an important point — Although the MongoDB database application is highly available for reads from secondaries during the elections, the database is not available for writes until a Primary is elected. So it is important to ensure the primary is available sooner than later to meet your SLA for writes. With the tips discussed here, you can have a new Primary elected within 3 seconds. If your application was serving about 10,000 operations / second, you have about 30,000 operations waiting on the new Primary. Now, you may wonder — “ ?” What measures can I take to ensure that the database server would not cripple when all those 30,000 operations hit the new Primary at the same time Again — it’s a great question, but that’s a topic for another day. Hopefully, you learned something new today on you scale the path to “ ”. Mastering MongoDB — One tip a day Previous Articles Mastering MongoDB — Series of articles solely created for you to master MongoDB One tip a day series Tip # 003: A long awaited and most requested feature for many, has finally arrived Transactions Tip # 002: How to prevent someone dropping your collections? createRole Tip # 001: Know the operations currently executing on MongoDB server inside out currentOp