Photo by Evgeni Tcherkasski on Unsplash
Querying elasticsearch can be very confusing, especially when you are just starting to work with the engine. In this article, I would like to give you a jump-start and simplify this subject.
Our query is sent to elasticsearch “_search” API in the body of the request. Usually we would use one of the elasticsearch client sdk’s, depending on language we want to use.
Before we dive in, I would like to mention a few points about elasticsearch indexing and mapping process.
We have two documents:
Doc_1 — “in the summer the quick brown fox jump over the lazy dog”
Doc_2 — “the quick brown fox jump over the lazy dog”
Both documents are indexed by elasticsearch. The result of the indexing process is an inverted index:
Each token in the text is mapped to the corresponding documents.
During the indexing process, the text is transformed:
All those three elements define an analyzer. Each index has an analyzer attached to it. elasticsearch has built-in analyzers and you can also build your own custom analyzer and attach it to your index.
By elasticsearch documentation:
Mapping is the process of defining how a document, and the fields it contains, are stored and indexed. For instance, use mappings to define:
When you create a new index you have three options:
Fields and mapping types do not need to be defined before being used. Thanks to dynamic mapping, new field names will be added automatically, just by indexing a document. New fields can be added both to the top-level mapping type, and to inner and fields. elasticsearch documentation
Dynamic mapping rules:
Text fields can be mapped as:
Full-text
— If the field is the body of an email or a product description, then the field should be mapped as full text. The text is tokenized based on the analyzer and you can search each word in the text individually.Keyword
— If you need to index structured content such as email addresses, hostnames, status codes, or tags, likely, you should use a keyword field. The string is considered as a single unit and the whole string is indexed. There is no option for partial matcheselasticsearch dynamic mapping is mapping text fields with both types, so you can search it either way (exact phrase or partial):
{
"name": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
}
JSON doesn’t have a date data type, so dates in Elasticsearch can either be:
1. Strings containing formatted dates, e.g. “2015–01–01” or “2015/01/01 12:10:30”.
2. A long number representing milliseconds-since-the-epoch.
3. An integer representing seconds-since-the-epoch. Official documentation
Internally, dates are converted to UTC (if the time-zone is specified) and stored as a long number representing milliseconds-since-the-epoch.”
You can define your custom date format:
{
"mappings": {
"properties": {
"date": {
"type": "date",
"format": "yyyy-MM-dd"
}
}
}
}
Elasticsearch supports other field types, you can take a look at them at here
Every query starts with “query” clause
{
"query": {
}
}
When we query elasticsearch we need to take into account two things:
Whe we write a query we can use two types of clauses:
Before we start to write a compound query we need to:
First, let’s understand the concept of context in elasticsearch.
In elasticsearch we have two contexts of search:
In the query context, a query clause answers the question “How well does this document match this query clause?” Besides deciding whether or not the document matches, the query clause also calculates a relevance score in themeta-field. Official documentation_score
Query context is in effect whenever a query clause is passed to a query parameter. This could be a
query
clause or for example, must
, should
, must_not
clauses of the boolean compound query. The elasticsearch documentary mentions at each clause documentation if it contributes to the final score or not.
{
"query": {
"bool": {
"must": [
{
"match": {
"street": "ditmas"
}
},
{
"match": {
"street": "avenue"
}
}
]
}
}
}
In the example above, we have a
must
clause. Query context” means that the leaf queries inside it will affect the score of the matching documents.This is the theory behind the scoring algorithm.
The score is very helpful when you want to order your results by relevance.
In filter context, a query clause answers the question “Does this document match this query clause?” The answer is a simple Yes or No — no scores are calculated. Filter context is mostly used for filtering structured data, e.g.
1. Does thisfall into the range 2015 to 2016?timestamp
2. Is thefield set tostatus
? official documentationpublished
The Filter context is in effect whenever a query clause is passed to a filter parameter. For example, the
filter
, must_not
parameter can be passed to the bool
compound query. Like the query context, you should look at the documentation if the clause query affects the scoring or not.
{
"query": {
"bool": {
"filter": {
"range": {
"age": {
"gte": 20,
"lte": 30
}
}
}
}
}
}
In the example above, we have the
filter
clause that is “Filter context” which means that the leaf queries inside it will not affect the score of the matching documents.Inside our search clause, we can combine query and filter context with a compound query like
bool
. In that case, only the search terms that appear in the query context clauses affects the score of each document. If we only have a filter context, then all the documents will have a score of zero.The decisions we took before will determine the layout of our query.
For example, this query uses query context and filter context together:
{
"query": {
"bool": {
"must": {
Leaf query clauses - affects the scoring of matching documents
},
"must_not" : {
Leaf query clauses - affects the scoring of matching documents
},
"should" : {
Leaf query clauses - affects the scoring of matching documents
},
"filter": {
Leaf query clauses - doesn't affects the scoring of matching documents
}
}
}
}
Only the query clauses that appear inside the
must
, must_not
, should
clause will affect the score of each document (they are query context). Elasticsearch takes a more-matches-is-better approach means that score from the
must
, must_not
, should
will be added together to provide the final score. If we don’t need a score at all, we can use only the filter clause. For example, if we search over structured data or search for exact values like binary or dates we will only use the filter context:
{
"query": {
"bool": {
"filter": [
{
"term": { "gender": "female" }
},
{
"range": { "age": {"gte":"50"} }
}
]
}
}
}
All the matching documents in the result of the query above will have a score of zero.
While building our outer layout, we decided what are the building blocks of our query. We also decided which fields will determine our results score. As you can see elasticsearch has a lot of options and we only covered the basics in this article. Each compound query can wrap other compound queries and so on. My advice to you is to try to keep it as simple as possible.
Now it is time to write our inner/leaf search query (what comes inside our container clauses).
Here we also have decisions to make 🙂
For every field that we search on, we need to:
Decide if Is this field is relevant to the score of the documents?
query
clausefilter
clause(remember that a filter can only be nested inside a Boolean clause)Check the type of the field and how it was mapped?
keyword
, then we only have the option of searching it in the exact way it was indexed (not tokenized, uppercase/lowercase letters, etc…).Let’s say, for example, we have indexed a document with “notes” field and it contains the text — “The quick BRown fox”
notes
field was mapped as Keyword, then the inverted index would contain “The quick BRown fox” text mapped to that document. searching “The quick BRown fox” text exactly will match that document.notet
field was mapped as full text, then in the inverted index, we'll have the tokens : [the, quick, brown, fox] separately connected to the document — searching any of these tokens or their synonyms will match that document. Decide How our text will be sent the search engine
When we send a query to the elasticsearch engine, we have two options:
1. Send it as is
For that choice we use the term level queries. For example, if you search for the phrase “Star Trek” then the query engine will check the inverted index for “Star Trek”
2. Send it analyzed
For that choice, we use the full-text queries. The searched text will pass through the same analyzer as the indexed text passed in the indexing process(we can also provide different analyzers as a property to the search service). It will be tokenized and filtered. For example, if you search for the phrase “Star Trek” then the query engine will check the inverted index for [“star”, “trek”] (depends on the analyzer you chose).
Note : If the field was originally mapped as a keyword, then you will have to send the exact text as it was indexed, to get results
Most of the time we will want the searched text to be analyzed before it is sent to the search engine, it will give better results. But sometimes we want to search the exact word or sentence, usually in data like numbers, dates, and enums.
Full-text query example
{
"query": {
"match": {
"mail_body": "Jeff BRidges"
}
}
}
Term query example
{
"query": {
"term": {
"mail_from": "[email protected]"
}
}
}
Compound query —final detailed example
{
"query": {
"bool": {
"must": {
"match": {
"mail_body": "Jeff BRidges"
}
},
"filter": {
"term": {
"mail_from": "[email protected]"
}
}
}
}
}
query:
main query containe
bool
: compound query containermust
: this is a query context query, each leaf query inside it will contribute to the score of the matching documentsmatch
: this is a full-text query, means that the text “Jeff BRidges” will pass through the analyzer and transformed to [“jeff”, “bridges”]. Make sure you use that option only if the “mail_body” field was mapped as a full-text field.
filter
: this is a filter context query, each leaf query inside it will not contribute to the score of the matching documents and the clauses are considered for caching.term
: this is a term level query, the text “[email protected]” will not pass through the analyzer and be sent as is to the search engine.Elasticsearch query DSL is fairly not the simplest thing to use, but once you know how to use it, it can be a powerful tool. In this article, I tried to give you guys a jump start for querying elasticsearch and I encourage you to dig deeper into the elasticsearch documentation.
Once you understand all the concepts we discussed in the article, you'll find it easier to walk through the Elasticsearch documentation and find all the solutions you need.