Data modeling in Elasticsearch is not as obvious as it is when dealing with relational databases. Unlike traditional relational databases that rely on data normalization and SQL joins, Elasticsearch requires alternative approaches for managing relationships.
There are four common workarounds to managing relationships in Elasticsearch:
Application-side joins
Data denormalization
Nested field types and nested queries
Parent-child relationships
In this blog, we’ll discuss how you can design your data model to handle relationships using the nested field type and parent-child relationships. We’ll cover the architecture, performance implications, and use cases for these two techniques.
Elasticsearch supports nested structures, where objects can contain other objects. Nested field types are JSON objects within the main document, which can have their own distinct fields and types. These nested objects are treated as separate, hidden documents that can only be accessed using a nested query.
Nested field types are well-suited for relationships where data integrity, close coupling, and hierarchical structure are important. These include one-to-one and one-to-many relationships where there is one main entity. For example, representing a person and their multiple addresses and phone numbers within a single document.
With nested field types, Elasticsearch stores the entire document, as well as parent and nested objects, on a single Lucene block and segment. This can result in faster query speeds as the relationship is contained to a document.
Let’s look at an example of a blog post with comments. We want to nest the comments below the blog post so they can be easily queried together in the same document.
{
"post_id": "1",
"title": "Introduction to Elasticsearch Data Modeling",
"content": "Exploring various data modeling options in Elasticsearch.",
"comments": [
{
"comment_id": "101",
"text": "Great overview of data modeling!"
},
{
"comment_id": "102",
"text": "Looking forward to more content."
}
]
}
The benefits of nested object relationships include:
Update inefficiency: Updates, inserts, and deletes on any part of a document with nested objects require reindexing the entire document, which can be memory-intensive, especially if the documents are large or updates are frequent.
Query performance with large nested fields: If you have documents with particularly large nested fields, this can have a performance implication. This is because the search request retrieves the entire document.
Multiple levels of nesting can become complex: Running queries across nested structures with multiple levels can still become complex. That’s because queries may involve nested queries within nested queries, leading to less readable code.
In parent-child mapping, documents are organized into parent and child types. Each child document has a direct association with a parent document. This relationship is established through a specific field value in the child document that matches the parent's ID. The parent-child model adopts a decentralized approach where parent and child documents exist independently.
Parent-child joins are suitable for one-to-many or many-to-many relationships between entities. Imagine an application where you want to create relationships between companies and contacts and want to search for companies and contacts as well as contacts at specific companies.
Elasticsearch makes parent-child joins performant by keeping track of what parents are connected to which children and having both entities reside on the same shard. By localizing the join operation, Elasticsearch avoids the need for extensive inter-shard communication, which can be a performance bottleneck.
Let’s take the example of a parent-child relationship for blog posts and comments. Each blog post, i.e., the parent, can have several comments, i.e., the children. To create the parent-child relationship, let’s index the data as follows:
PUT my-index-000001
{
"mappings": {
"properties": {
"post_id": {
"type": "keyword"
},
"post_id": {
"type": "join",
"relations": {
"post": "comment"
}
}
}
}
}
A parent document would be a post that looks like this:
{
"post_id": "1",
"title": "Introduction to Elasticsearch Data Modeling",
"content": "Exploring various data modeling options in Elasticsearch."
}
The child document would then be a comment that contains the post_id linking it to its parent.
{
"comment_id": "101",
"text": "Great overview of data modeling!",
"post_id": "1"
}
The benefits of parent-child modeling include:
Resembles relational data model: In parent-child relationships, the parent and child documents are separate and are linked by a unique parent ID. This setup is closer to a relational database model and can be more intuitive for those familiar with such concepts.
Update efficiency: Child documents can be added, modified, or deleted without affecting the parent document or other child documents. This is particularly beneficial when dealing with a large number of child documents that require frequent updates. Note that associating a child document with a different parent is a more complex process as the new parent may be on another shard.
Better suited for heterogeneous children: Since child documents are stored separately, they may be more memory and storage-efficient, especially in cases where there are many child documents with significant size differences.
The drawbacks of parent-child relationships include:
Expensive, slow queries: Joining documents across separate indices adds computational work during query execution, again impacting performance. Elasticsearch notes that parent-child queries can be 5-10x slower than querying nested objects.
Mapping overhead: Parent-child relationships can consume more memory and cache resources. Elasticsearch maintains a map of parent-child relationships, which can grow large and consume significant memory, especially with a high volume of documents.
Shard size management: Since both parent and child documents reside on the same shard, there's a potential risk of uneven data distribution across the cluster. Some shards might become significantly larger than others, especially if there are parent documents with many children. This can lead to challenges in managing and scaling the Elasticsearch cluster.
Reindexing and cluster maintenance: If you need to reindex data or change the sharding strategy, the parent-child relationship can complicate this process. You'll need to ensure that the relationship integrity is maintained during such operations. Routine cluster maintenance tasks, such as shard rebalancing or node upgrades, may become more complex. Special care must be taken to ensure that parent-child relationships are not disrupted during these processes.
Elastic, the company behind Elasticsearch, will always recommend that you do application-side joins, data denormalization and/or nested objects before going down the path of parent-child relationships.
The table below provides a recap of the characteristics of nested field types and queries and parent-child relationships to compare the data modeling approaches side by side.
|
Nested field types and nested queries |
Parent-child relationships |
---|---|---|
Definition |
Nests an object within another object |
Links parent and child documents together |
Relationships |
One-to-one, one-to-many |
One-to-many, many-to-many |
Query speed |
Generally faster than parent-child relationships as the data is stored in the same block and segment |
Generally 5-10x slower than nested objects as parent and child documents are joined at query time. |
Query flexibility |
Less flexible than parent-child queries as it limits the scope of the querying to within the bounds of each nested object |
Offers more flexibility in querying as parent or child documents can be queried together or separately |
Data updates |
Updating nested objects required the reindexing of the entire document |
Updating child documents is easier as it does not require all documents to be reindexed |
Management |
Simpler management since everything is contained within a single document |
More complex to manage due to separate indexing and maintaining of relationships between parent and child documents |
Use cases |
Store and query complex data with multiple levels of hierarchy |
Relationships where there are few parents and many children, like products and product reviews |
While Elasticsearch provides several workarounds to SQL-style joins, including nested queries and parent-child relationships, it's established that these models do not scale well. When designing for applications at scale, it may make sense to consider an alternative approach with native SQL join capabilities, Rockset.
Rockset is a search and analytics database that's designed for SQL search, aggregations, and joins on any data, including deeply nested JSON data. As data is streamed into Rockset, it is encoded in the database’s core data structures used to store and index the data for fast retrieval. Rockset indexes the data in a way that allows for fast queries, including joins, using its SQL-based query optimizer. As a result, there is no upfront data modeling required to support SQL joins.
One of the challenges with Elasticsearch is how to preserve the relationship in an efficient manner when data is updated. One of the reasons is that Elasticsearch is built on Apache Lucene, which stores data in immutable segments, resulting in all documents needing to be reindexed. Rockset uses RocksDB, a key-value store open-sourced by Meta and built for data mutations, to be able to efficiently support field-level updates without needing to reindex entire documents.
Let’s compare the parent-child relationship approach in Elasticsearch with a SQL query in Rockset.
In the parent-child relationship example above, we modeled posts with multiple comments by creating two document types:
posts or the parent document type
comments or the child document types
We used a unique identifier, the parent ID, to establish the relationship between the parent and child documents. At query time, we use the Elasticsearch DSL to retrieve comments for a specific post.
In Rockset, the data containing posts would be stored in one collection, a table in the relational world, while the data containing comments would be stored in a separate collection. At query time, we would join the data together using a SQL query.
Here are the two approaches side-by-side:
POST /blog/posts/1
{
"title": "Elasticsearch Modeling",
"content": "A post about data modeling in Elasticsearch"
}
POST /blog/comments/2?parent=1
{
"text": "Great post!"
}
POST /blog/comments/3?parent=1
{
"text": "I learned a lot from this."
}
To retrieve a post by its title and all of its comments, you would need to create a query as follows.
GET /posts/_search
{
"query": {
"bool": {
"must": [
{ "match": { "title": "Exploring Elasticsearch Models" } }
]
}
},
"inner_hits": {
"_source": ["text"],
"name": "comments",
"path": "comments"
}
}
To then query this data, you just need to write a simple SQL query.
SELECT p.title, p.content, c.text
FROM posts p
JOIN comments c ON p.post_id = c.post_id
WHERE p.post_id = 1;
If you have multiple data sets that need to be joined for your application, then Rockset is more straightforward and scalable than Elasticsearch. It also simplifies operations as you do not need to remodel your data, manage updates, or reindex operations.
This blog provided an overview of the nested field types and nested queries, and parent-child relationships in Elasticsearch with the goal of helping you to determine the best data modeling approach for your workload.
The nested field types and queries are useful for one-to-one or one-to-many relationships where the relationship is maintained within a single document. This is considered to be a simpler and more scalable approach to relationship management.
The parent-child relationship model is better suited for one-to-many to many-to-many relationships but comes with increased complexity, especially as the relationships need to be contained to a specific shard.
If one of the primary requirements of your application is modeling relationships, it may make sense to consider Rockset. Rockset simplifies data modeling and offers a more scalable approach to relationship management using SQL joins. You can compare and contrast the performance of Elasticsearch and Rockset by starting a free trial with $300 in credits today.