How I Reindex Millions Elasticsearch Documents Using AWS Lambda

Or how to overcome the 5 minute Lambda execution limit

So last December I was rewriting our indexing microservice in a serverless way. Along the way, I had to migrate Elasticsearch from version 2.3 to 6.1.

Not that I was eager to rush the upgrade, quite opposite. But one day Elastic Cloud announced version 2.3 is approaching the end-of-life.

As you can imagine, jumping over several major version introduces breaking changes, so there was a good reasoning behind rewriting microservice from scratch — this time in a serverless way.

Problem, Context, Solution

It was quite interesting undertaking to shift from a queue-based (SQS) to an event-driven (SNS) indexing, but I leave it for another post.

First of all, when do you reindex data from the database to Elasticsearch?

Changing index mapping. You add a new field, change type of an old one.
ES Cluster outage. When this happens, no new data is written to the search index

In both cases, you need to fetch loads of documents from the database and flush everything to Elasticsearch as fast as possible.

Let’s look at the old way of reindexing when you have a stateful long-running microservice.

The old way: long-running microservice

The problem comes in the “loop” which can take hours to iterate over all documents. And Lamba runs for 5 minutes only. And you don’t have a state.

A nice thing would be having SQS support for Lambda. But as for today, it’s still on the roadmap. So what if we temporarily keep the state in the DB itself? Not truly a serverless way, but everything is a tradeoff.

By the way, I used Mermaid to generate the sequence diagram above.

Recursive Lambdas

I ended up creating a collection to keep track of reindexing progress. More like a list of jobs, which include:

Query selector: a starting point to open MongoDB cursor
Progress: number of successful and failed operations
ID of last reindexed document: this one is important

The idea is to kick off 1 Lambda to reindex the first batch of documents, let’s say 10 000.

It creates a job with id, puts data to ES and calls itself recursively.

The next iteration knows query selector by job id and appends ID of last reindexed document. So it can start from the place of last execution.

It sounds complex, so let’s revisit the flow.

The new way: recursive Lambda function

Some tricks which make this possible:

Sort your query by _id: it’s an indexed field and cheap to sort. Date fields might be more suitable depending on a case
Sorted queries allow predictable iteration: every next execution queries by selector + {_id: {$gte: lastDocumentId}}. More on pagination in the article We’re doing pagination wrong
Disable ES index refresh during heavy reindexing: but don’t forget to enable it back (defaults to 1s)

PUT /index_name/_settings{ “index” : { “refresh_interval” : “-1” } }

Wins

So it may look like a lot of a hassle, but this was the only non-obvious part of old indexing microservice to migrate.

This gave a chance to rethink implementation in a way to reindex millions of documents in a matter of 15-ish minutes.

Since you get a fresh Lambda container every so often, there is a little chance to catch a memory leak, which was an issue before.

Not to forget AWS X-Ray which plays nicely with Lambda. So many performance bottlenecks were discovered in calls to Mongo / S3 / ES.

And in the end, you gain all the usual perks of serverless, enjoy!