Elasticsearch is an open-source, distributed JSON-based search and analytics engine built using Apache Lucene to provide fast real-time search functionality. It is a NoSQL data store that is document-oriented, scalable, and schemaless by default. Elasticsearch is designed to work at scale with large data sets. As a search engine, it provides fast indexing and search capabilities that can be horizontally scaled across multiple nodes.
Shameless plug: Rockset is a real-time indexing database in the cloud. It automatically builds indexes that are optimized not just for search but also aggregations and joins, making it fast and easy for your applications to query data, regardless of where it comes from and what format it is in. But this post is about highlighting some workarounds, in case you really want to do SQL-style joins in Elasticsearch.
We live in a highly connected world where handling data relationships is important. Relational databases are good at handling relationships, but with constantly changing business requirements, the fixed schema of these databases results in scalability and performance issues. The use of NoSQL data stores is becoming increasingly popular due to their ability to tackle several challenges associated with traditional data handling approaches.
Enterprises are continually dealing with complex data structures where aggregations, joins, and filtering capabilities are required to analyze the data. With the explosion of unstructured data, there are a growing number of use cases requiring the joining of data from different sources for data analytics purposes.
While joins are primarily an SQL concept, they are equally important in the NoSQL world as well. SQL-style joins are not supported in Elasticsearch as first-class citizens. This article will discuss how to define relationships in Elasticsearch using various techniques such as denormalizing, application-side joins, nested documents, and parent-child relationships. It will also explore the use cases and challenges associated with each approach.
Because Elasticsearch is not a relational database, joins do not exist as a native functionality like in an SQL database. It focuses more on search efficiency as opposed to storage efficiency. The stored data is practically flattened out or denormalized to drive fast search use cases.
There are multiple ways to define relationships in Elasticsearch. Based on your use case, you can select one of the below techniques in Elasticsearch to model your data:
One-to-one relationships: Object mapping
One-to-many relationships: Nested documents and the parent-child model
Many-to-many relationships: Denormalizing and application-side joins
One-to-one object mappings are simple and will not be discussed much here. The remainder of this blog will cover the other two scenarios in more detail.
There are four common approaches to managing data in Elasticsearch:
Denormalization provides the best query search performance in Elasticsearch, since joining data sets at query time isn’t necessary. Each document is independent and contains all the required data, thus eliminating the need for expensive join operations.
With denormalization, the data is stored in a flattened structure at the time of indexing. Though this increases the document size and results in the storage of duplicate data in each document. Disk space is not an expensive commodity and thus little cause for concern.
While working with distributed systems, having to join data sets across the network can introduce significant latencies. You can avoid these expensive join operations by denormalizing data. Many-to-many relationships can be handled by data flattening.
Application-side joins can be used when there is a need to maintain the relationship between documents. The data is stored in separate indices, and join operations can be performed from the application side during query time. This does, however, entail running additional queries at search time from your application to join documents.
Application-side joins ensure that data remains normalized. Modifications are done in one place, and there is no need to constantly update your documents. Data redundancy is minimized with this approach. This method works well when there are fewer documents and data changes are less frequent.
The application needs to execute multiple queries to join documents at search time. If the data set has many consumers, you will need to execute the same set of queries multiple times, which can lead to performance issues. This approach, therefore, does not leverage the real power of Elasticsearch.
This approach results in complexity at the implementation level. It requires writing additional code at the application level to implement join operations to establish a relationship among documents.
The nested approach can be used if you need to maintain the relationship of each object in the array. Nested documents are internally stored as separate Lucene documents and can be joined at query time. They are index-time joins, where multiple Lucene documents are stored in a single block. From the application perspective, the block looks like a single Elasticsearch document. Querying is therefore relatively faster, since all the data resides in the same object. Nested documents deal with one-to-many relationships.
Creating nested documents is preferred when your documents contain arrays of objects. Figure 1 below shows how the nested type in Elasticsearch allows arrays of objects to be internally indexed as separate Lucene documents. Lucene has no concept of inner objects, hence it is interesting to see how Elasticsearch internally transforms the original document into flattened multi-valued fields.
One advantage of using nested queries is that it won’t do cross-object matches, hence unexpected match results are avoided. It is aware of object boundaries, making the searches more accurate.
Figure 1: Arrays of objects indexed internally as separate Lucene documents in Elasticsearch using nested approach
Parent-child relationships leverage the join datatype in order to completely separate objects with relationships into individual documents—parent and child. This enables you to store documents in a relational structure in separate Elasticsearch documents that can be updated separately.
Parent-child relationships are beneficial when the documents need to be updated often. This approach is therefore ideal for scenarios when the data changes frequently. Basically, you separate out the base document into multiple documents containing parent and child. This allows both the parent and child documents to be indexed/updated/deleted independently of one another.
To optimize Elasticsearch performance during indexing and searching, the general recommendation is to ensure that the document size is not large. You can leverage the parent-child model to break down your document into separate documents.
However, there are some challenges with implementing this. Parent and child documents need to be routed to the same shard so that joining them during query time will be in-memory and efficient. The parent ID needs to be used as the routing value for the child document. The _parent
field provides Elasticsearch with the ID and type of the parent document, which internally lets it route the child documents to the same shard as the parent document.
Elasticsearch allows you to search from complex JSON objects. This, however, requires a thorough understanding of the data structure to efficiently query from it. The parent-child model leverages multiple filters to simplify the search functionality:
has_child
queryReturns parent documents that have child documents matching the query.
has_parent
queryAccepts a parent and returns child documents that associated parents have matched.
inner_hits
queryFetches relevant children information from the has_child
query.
Figure 2 shows how you can use the parent-child model to demonstrate one-to-many relationships. The child documents can be added/removed/updated without impacting the parent. The same holds true for the parent document, which can be updated without reindexing the children.
Figure 2: Parent-child model for one-to-many relationships
Choosing the right Elasticsearch data modeling design is critical for application performance and maintainability. When designing your data model in Elasticsearch, it is important to note the various pros and cons of each of the four modeling methods discussed herein.
In this article, we explored how nested objects and parent-child relationships enable SQL-like join operations in Elasticsearch. You can also implement custom logic in your application to handle relationships with application-side joins. For use cases in which you need to join multiple data sets in Elasticsearch, you can ingest and load both these data sets into the Elasticsearch index to enable performant querying.
Out of the box, Elasticsearch does not have joins as in an SQL database. While there are potential workarounds for establishing relationships in your documents, it is important to be aware of the challenges each of these approaches presents.
When there is a need to combine multiple data sets for real-time analytics, a database that provides native SQL joins can handle this use case better. Like Elasticsearch, Rockset is used as an indexing layer on data from databases, event streams, and data lakes, permitting schemaless ingest from these sources. Unlike Elasticsearch, Rockset provides the ability to query with full-featured SQL, including joins, giving you greater flexibility in how you can use your data.
Also published here.