At Plaid, we make heavy use of Amazon-hosted ElasticSearch for real time log analysis — everything from finding the root cause of production errors to analyzing the lifecycle of API requests.
The ElasticSearch cluster is one of the most widely used systems internally. If it is unavailable, many teams can’t do their work effectively. As such, ElasticSearch availability is one of the top SLAs that our team — the Data Science and Infrastructure (DSI) team — is responsible for.
So, you can imagine the urgency and seriousness when we experienced repeated ElasticSearch outages over a two-week span in March of 2019. During that time, the cluster would go down multiple times a week as a result of data nodes dying, and all we could see from our monitoring was JVM Memory Pressure spikes on the crashing data nodes.
This blog post is the story of how we investigated this issue and ultimately addressed the root cause. We hope that by sharing this, we can help other engineers who might be experiencing similar issues and save them a few weeks of stress.
During the outages, we would see something that looked like this:
Graph of ElasticSearch node count during one of these outages on 03/04 at 16:43.
In essence, over the span of 10–15 minutes, a significant % of our data nodes would crash and the cluster would go into a red state.
The cluster health graphs in the AWS console indicated that these crashes were immediately preceded by JVMMemoryPressure spikes on the data nodes.
JVMMemoryPressure spiked around 16:40, immediately before nodes started crashing at 16:43.
Breaking down by percentiles, we could infer that only the nodes that crashed experienced high JVMMemoryPressure (the threshold seemed to be 75%). Note this graph is unfortunately in UTC time instead of local SF time, but it’s referring to the same outage on 03/04.
In fact, zooming out across multiple outages in March (the spikes in this graph), you can see every incident we encountered then had a corresponding spike in JVMMemoryPressure.
After these data nodes crashed, the AWS ElasticSearch auto recovery mechanism would kick in to create and initialize new data nodes in the cluster. Initializing all these data nodes could take up to an hour. During this time, ElasticSearch was completely unqueryable.
After data nodes were initialized, ElasticSearch began the process of copying shards to these nodes, then slowly churned through the ingestion backlog that was built up. This process could take several more hours, during which the cluster was able to serve queries, albeit with incomplete and outdated logs due to the backlog.
We considered several possible scenarios that could lead to this issue:
At this point, we suspected, correctly, the node failures are likely due to resource intensive search queries running on the cluster, causing nodes to run out of memory. However, two key questions remained:
As we continued experiencing ElasticSearch outages, we tried a few things to answer these questions, to no avail:
As it turned out, finding the root cause required accessing logs that weren’t available in the AWS console.
After filing multiple AWS support tickets and getting templated responses from the AWS support team, we (1) started looking into other hosted log analysis solutions outside of AWS, (2) escalated the issue to our AWS technical account manager, and (3) let them know that we were exploring other solutions. To their credit, our account manager was able to connect us to an AWS ElasticSearch operations engineer with the technical expertise to help us investigate the issue at hand (thanks Srinivas!).
Several phone calls and long email conversations later, we identified the root cause: user-written queries that were aggregating over a large number of buckets. When these queries were sent to the ElasticSearch, the cluster tried to keep an individual counter for every unique key it saw. When there were millions of unique keys, even if each counter only took up a small amount of memory, they quickly added up.
Srinivas on the AWS team came to this conclusion by looking at logs that are only internally available to the AWS support staff. Even though we had enabled error logs, search slow logs, and index slow logs on our ElasticSearch domain, we nonetheless did not (and do not) have access to these warning logs that were printed shortly before the nodes crashed. But if we had access to these logs, we would have seen:
[2019–03–19T19:48:11,414][WARN][o.e.d.s.a.MultiBucketConsumerService]This aggregation creates too many buckets (10001) and will throw an error in future versions. You should update the [search.max_buckets] cluster setting or use the [composite] aggregation to paginate all buckets in multiple requests.
The query that generated this log was able to bring down the cluster because:
To address the two problems above, we needed to:
Unfortunately, AWS ElasticSearch does not allow clients to change these settings directly by making PUT requests to the_cluster/settings ElasticSearch endpoint, so you have to file a support ticket in order to update them.
Once the settings are updated, you can double check by curling _cluster/settings. Side note: if you look at_cluster/settings, you’ll see both persistent and transparent settings. Since AWS ElasticSearch does not allow cluster level reboots, these two are basically equivalent.
Once we configured the circuit breaker and max buckets limitations, the same queries that used to bring down the cluster would simply error out instead of crashing the cluster.
From reading about the above investigation and fixes, you can see how much the lack of log observability limited our abilities to get to the bottom of the outages. Dear AWS ElasticSearch engineers, if you are reading this, we’d love to get more control and visibility into our clusters so we can be better equipped to maintain our clusters.
Until then, for the developers out there considering using AWS ElasticSearch, know that by choosing this instead of hosting ElasticSearch yourself, you are giving up access to raw logs and the ability to tune some settings yourself. This will significantly limit your ability to troubleshoot issues, but it also comes with the benefits of not needing to worry about the underlying hardware, and being able to take advantage of AWS’s built-in recovery mechanisms.
If you are already on AWS ElasticSearch, turn on all the logs immediately — namely, error logs, search slow logs, and index slow logs. Even though these logs are still incomplete (for example, AWS only publishes 5 types of debug logs), it’s still better than nothing. Just a few weeks ago, we tracked down a mapping explosion that caused the master node CPU to spike using the error log and CloudWatch Log Insights.
Thank you to Michael Lai, Austin Gibbons, Jeeyoung Kim, and Adam McBride for proactively jumping in and driving this investigation. Giving credit where credit is due, this blog post is really just a summary of the amazing work that they’ve done.
Want to work with these amazing engineers? We are hiring!
If you like this post, follow me on Twitter for more posts on engineering, processes, and backend systems.