How DoorDash Optimized Item Availability at Scale Using Elasticsearch

This is from my time at DoorDash. The homepage serves millions of users and is one of the highest traffic surfaces. One of the things people see when they open the app is item carousels - horizontal rows of individual menu items from nearby stores grouped by a category. Think - Most ordered items near you, Best pizzas near you, etc. One important thing was to ensure we don't show stale/unavailable items to the user in near-realtime latency at high throughput and scale

Starting with live availability checks

We started with the simplest approach: just call the source of truth for item availability, which in this case was the menu service. At request time, we'd ask, "Hey, is this item available right now?"

The problem was that the menu service was not built for the homepage scale read volume. It was designed to handle individual store-level lookups. A user taps into a restaurant, and we fetch that restaurant's menu. What we were attempting on the homepage was a different fan-out pattern where every single app open would need availability checks for hundreds of items across dozens of stores. Even if we could somehow scale the menu service to handle that, a single homepage request would fan out into loads of downstream calls, one per store. The total budget for rendering an item carousel was under 300ms, and we'd still need time for ranking and personalization on top of that.

We did cache menu responses. But you'd still need to make the downstream calls on a cache miss, and availability data changes frequently enough that your TTLs can't be very aggressive. Reasons like stores going offline, items selling out, hours changing, etc. Short TTLs mean more misses, more misses mean more fan-out, and you're back to square one during peak traffic. Caching smooths things out, but it doesn't solve the fundamental shape of the problem.

So we went a different route. We listened to real-time menu update events and kept our Elasticsearch index in sync as availability changed. For the cases where we still needed to call the menu directly, we added caching with rate limiting as a safety net so we wouldn't accidentally DDoS our own service.

Why Index the Availability

Two major reasons. First, better recall: you can fetch the top 50 available items directly, rather than fetching 50 items and then post-filtering out the unavailable ones and ending up short. Second, we could use availability as one of several signals for pre-ranking within Elasticsearch, so items from stores that are currently open and active get boosted before they even leave the index. Less data to pull out and re-rank in the application layer.

Now, item availability was decided at multiple levels: is the store available? Is the menu on which the item lives available? Is the particular item itself available? All of that needed to be captured in the index schema. I'll walk through how we iterated on that.

Attempt 1: Nested Documents

We started with the most obvious representation:

{
  "availability_hours": [
    { "day": 1, "open": "10:00", "close": "14:00" },
    { "day": 1, "open": "17:00", "close": "22:00" },
    { "day": 2, "open": "10:00", "close": "14:00" }
  ]
}

This is how you'd model it in a database. Here day = 1 is Sunday, day = 2 is Monday, and so on. We indexed these as nested documents and ran nested queries to check if the current time fell within any window.

P99: 600ms. We needed under 300ms.

What we found is that nested documents in ES are stored as separate documents under the hood, and at query time, ES has to join them back to the parent. That join gets expensive fast when you have millions of items with multiple availability windows each. Gojek ran into the same wall with restaurant hours on GO-FOOD, and their P99 also blew up during peak load.

Attempt 2: Encoded Terms

Gojek's fix was to flatten availability into encoded integers. You take the day of the week, the hour, and the minute in 5-minute increments and smash them into one number. Monday 10:00am becomes 11000. Monday 10:05am is 11005. Tuesday 2:30pm is 21430.

Then you expand each availability window into every 5-minute slot it covers and store the whole list as terms. At query time, encode the current timestamp the same way and do a term match: "does this integer exist in the document?"

P99 dropped to about 350ms. Better, but we had a new problem: storage. An item available 12 hours a day, 7 days a week, generates over a thousand encoded time points per week. Multiply that across millions of items and our index ballooned to about 6x its original size. And 350ms was still above our target.

We tried to challenge ourselves to find another way out where we wouldn't have to store so much data and latency was also not that high.

Attempt 3: Ranges with BKD Trees

So what if instead of enumerating every 5-minute slot, we just stored the start and end of each window? Same encoding as before, Monday 10:00am is 11000, Monday 2:00pm is 11400, but we store [11000, 11400] as a range instead of listing out every value in between.

ES has range field types (integer_range, date_range) that shipped in version 5.2. They're backed by BKD trees internally, basically a tree structure that's good at numeric range lookups. A range gets stored as a 2D point (min, max) in the tree, and at query time, ES can skip over huge portions of the index by checking bounding boxes at each level instead of scanning everything.

The mapping:

{
  "availability_ranges": {
    "type": "integer_range"
  }
}

Indexed document:

{
  "availability_ranges": [
    { "gte": 11000, "lte": 11400 },
    { "gte": 11700, "lte": 12200 },
    { "gte": 21000, "lte": 21400 }
  ]
}

Query:

{
  "query": {
    "range": {
      "availability_ranges": {
        "relation": "contains",
        "gte": 11130,
        "lte": 11130
      }
    }
  }
}

You can also store multiple ranges in a single field on a single document, which meant we didn't need to go back to nested docs.

P99: 250ms. Under our 300ms target. Storage went back to roughly baseline since we're storing two integers per window instead of hundreds.

What We Learned

We tested all three approaches against real production data in a sandbox before committing to any of them. These weren't synthetic benchmarks; we replayed actual homepage traffic patterns against each index configuration. I'd recommend doing that for any search infrastructure change. Synthetic load tests would not have caught the storage blowup with the terms approach, for example, because the index size depends on actual availability patterns in the data.

If we'd stopped after the Gojek-style encoding, we probably would have shipped 350ms and moved on to other things. The storage bloat is what pushed us to keep looking, and that's how we stumbled into range fields. We went in trying to fix latency and it was actually a storage constraint that led us to the best answer.

The other thing is that the choice of ES field type matters way more than most people realize. Nested docs, terms on an inverted index, range queries on a BKD tree. Same data, same cluster, 600ms vs 250ms.

Further reading: