So last December I was rewriting our indexing microservice in a serverless way. Along the way, I had to migrate Elasticsearch from version 2.3 to 6.1.
Not that I was eager to rush the upgrade, quite opposite. But one day Elastic Cloud announced version 2.3 is approaching the end-of-life.
As you can imagine, jumping over several major version introduces breaking changes, so there was a good reasoning behind rewriting microservice from scratch — this time in a serverless way.
It was quite interesting undertaking to shift from a queue-based (SQS) to an event-driven (SNS) indexing, but I leave it for another post.
First of all, when do you reindex data from the database to Elasticsearch?
In both cases, you need to fetch loads of documents from the database and flush everything to Elasticsearch as fast as possible.
Let’s look at the old way of reindexing when you have a stateful long-running microservice.
The old way: long-running microservice
The problem comes in the “loop” which can take hours to iterate over all documents. And Lamba runs for 5 minutes only. And you don’t have a state.
A nice thing would be having SQS support for Lambda. But as for today, it’s still on the roadmap. So what if we temporarily keep the state in the DB itself? Not truly a serverless way, but everything is a tradeoff.
By the way, I used Mermaid to generate the sequence diagram above.
I ended up creating a collection to keep track of reindexing progress. More like a list of jobs, which include:
The idea is to kick off 1 Lambda to reindex the first batch of documents, let’s say 10 000.
It creates a job with id, puts data to ES and calls itself recursively.
The next iteration knows query selector by job id and appends ID of last reindexed document. So it can start from the place of last execution.
It sounds complex, so let’s revisit the flow.
The new way: recursive Lambda function
Some tricks which make this possible:
_id
: it’s an indexed field and cheap to sort. Date fields might be more suitable depending on a caseselector
+ {_id: {$gte: lastDocumentId}}
. More on pagination in the article We’re doing pagination wrong
PUT /index_name/_settings{ “index” : { “refresh_interval” : “-1” } }
So it may look like a lot of a hassle, but this was the only non-obvious part of old indexing microservice to migrate.
This gave a chance to rethink implementation in a way to reindex millions of documents in a matter of 15-ish minutes.
Since you get a fresh Lambda container every so often, there is a little chance to catch a memory leak, which was an issue before.
Not to forget AWS X-Ray which plays nicely with Lambda. So many performance bottlenecks were discovered in calls to Mongo / S3 / ES.
And in the end, you gain all the usual perks of serverless, enjoy!