royibeni

Senior Full Stack Developer

How To Master Elasticsearch Query DSL

Querying elasticsearch can be very confusing, especially when you are just starting to work with the engine. In this article, I would like to give you a jump-start and simplify this subject.
Our query is sent to elasticsearch “_search” API in the body of the request. Usually we would use one of the elasticsearch client sdk’s, depending on language we want to use.
Before we dive in, I would like to mention a few points about elasticsearch indexing and mapping process.

Indexing process

We have two documents:
Doc_1 — “in the summer the quick brown fox jump over the lazy dog” 
Doc_2 — “the quick brown fox jump over the lazy dog”
Both documents are indexed by elasticsearch. The result of the indexing process is an inverted index:
Each token in the text is mapped to the corresponding documents.
During the indexing process, the text is transformed:
  1. Character filter- One or more Character Filter that cleans up the text, strip unwanted characters like HTML tags
  2. Tokenizer— Single Tokenizer that breaks down the string into simple words (Tokens)
  3. Token Filters — Zero or more Token Filters that perform tasks such as lowercase token filter, stop words token filter, synonym filter, etc..
  4. Analyzer— Character filter + Tokenizer + Token Filters
All those three elements define an analyzer. Each index has an analyzer attached to it. elasticsearch has built-in analyzers and you can also build your own custom analyzer and attach it to your index.

Mapping

By elasticsearch documentation:
Mapping is the process of defining how a document, and the fields it contains, are stored and indexed. For instance, use mappings to define:
  1. Which string fields should be treated as full-text fields.
  2. Which fields contain numbers, dates, or geolocations.
  3. The format of date values.
  4. Custom rules to control the mapping for dynamically added fields.
When you create a new index you have three options:
  1. Define the mapping of each field on your own
  2. Use dynamic mapping and let elasticsearch “guess” the mapping
  3. Use both — define the important fields and let the elasticsearch engine to handle the rest of the fields.
Fields and mapping types do not need to be defined before being used. Thanks to dynamic mapping, new field names will be added automatically, just by indexing a document. New fields can be added both to the top-level mapping type, and to inner and fields. elasticsearch documentation
Dynamic mapping rules:

String Fields

Text fields can be mapped as:
  1. Full-text
    — If the field is the body of an email or a product description, then the field should be mapped as full text. The text is tokenized based on the analyzer and you can search each word in the text individually.
  2. Keyword
     — If you need to index structured content such as email addresses, hostnames, status codes, or tags, likely, you should use a keyword field. The string is considered as a single unit and the whole string is indexed. There is no option for partial matches
elasticsearch dynamic mapping is mapping text fields with both types, so you can search it either way (exact phrase or partial):
{
    "name": {
        "type": "text",
        "fields": {
            "keyword": {
                "type": "keyword",
                "ignore_above": 256
            }
        }
    }
}

Date fields

JSON doesn’t have a date data type, so dates in Elasticsearch can either be:
1.
Strings containing formatted dates, e.g. “2015–01–01” or “2015/01/01 12:10:30”.
2.
A long number representing milliseconds-since-the-epoch.
3.
An integer representing seconds-since-the-epoch. Official documentation
Internally, dates are converted to UTC (if the time-zone is specified) and stored as a long number representing milliseconds-since-the-epoch.”
You can define your custom date format:
{
  "mappings": {
    "properties": {
      "date": {
        "type":   "date",
        "format": "yyyy-MM-dd"
      }
    }
  }
}
Elasticsearch supports other field types, you can take a look at them at here

Build our query

Every query starts with “query” clause
{
    "query": {
         
    }
}
When we query elasticsearch we need to take into account two things:
  1. Remember that all the queries run against our inverted index. The analyzer(built-in or custom) that we choose for our index will affect our query clause (lower case, stem words, remove stop words, etc…)
  2. The mapping configuration of each field can affect our query. For example:
  • Text field — is our field configured to be full-text or keyword?
  • Date — which date format did we chose for our field?
  • Number —is our field type is integer, long or float?
Whe we write a query we can use two types of clauses:
  1. Compound query clause — this will be our wrapper clauses, they can combine Leaf queries and nested compound queries.
  2. Leaf query clause — query term for a particular field (field name and value)

Compound query clauses

Before we start to write a compound query we need to:
  1. Decide if we need a score for each document? The score will tell us the relevance of each document relative to the other results.
  2. What are the fields we need to query? Which fields control the score of the document?
First, let’s understand the concept of context in elasticsearch.
In elasticsearch we have two contexts of search:

Query context-

In the query context, a query clause answers the question “How well does this document match this query clause?” Besides deciding whether or not the document matches, the query clause also calculates a relevance score in the
_score 
meta-field. Official documentation
Query context is in effect whenever a query clause is passed to a query parameter. This could be a
query 
clause or for example,
must
,
should
,
must_not 
clauses of the boolean compound query.
The elasticsearch documentary mentions at each clause documentation if it contributes to the final score or not.
{
  "query": {
        "bool": {
            "must": [
                {
                    "match": {
                        "street": "ditmas"
                    }
                },
                {
                    "match": {
                        "street": "avenue"
                    }
                }
            ]
        }
    }
}
In the example above, we have a
must 
clause. Query context” means that the leaf queries inside it will affect the score of the matching documents.
This is the theory behind the scoring algorithm.
The score is very helpful when you want to order your results by relevance.

Filter context

In filter context, a query clause answers the question “Does this document match this query clause?” The answer is a simple Yes or No — no scores are calculated. Filter context is mostly used for filtering structured data, e.g.
1. Does this
timestamp 
fall into the range 2015 to 2016?
2. Is the
status 
field set to
published
? official documentation
The Filter context is in effect whenever a query clause is passed to a filter parameter. For example, the
filter
,
must_not 
parameter can be passed to the
bool 
compound query.
Like the query context, you should look at the documentation if the clause query affects the scoring or not.
{
  "query": {
    "bool": {
      "filter": {
        "range": {
          "age": {
              "gte": 20,
              "lte": 30
          }
        }
      }
    }
  }
}
In the example above, we have the
filter 
clause that is “Filter context” which means that the leaf queries inside it will not affect the score of the matching documents.
Inside our search clause, we can combine query and filter context with a compound query like
bool
. In that case, only the search terms that appear in the query context clauses affects the score of each document. If we only have a filter context, then all the documents will have a score of zero.
The decisions we took before will determine the layout of our query.
For example, this query uses query context and filter context together:
{
"query": {
   "bool": {
      "must": {
          Leaf query clauses - affects the scoring of matching documents
      },
      "must_not" : { 
          Leaf query clauses - affects the scoring of matching documents
      },
      "should" : {   
          Leaf query clauses - affects the scoring of matching documents
      },
      "filter": {   
         Leaf query clauses - doesn't affects the scoring of matching documents 
      }
    }
  }
}
Only the query clauses that appear inside the
must
,
must_not
,
should
 
clause will affect the score of each document (they are query context).
Elasticsearch takes a more-matches-is-better approach means that score from the
must
,
must_not
,
should 
will be added together to provide the final score.
If we don’t need a score at all, we can use only the filter clause. For example, if we search over structured data or search for exact values like binary or dates we will only use the filter context:
{
  "query": {
    "bool": {
      "filter": [
          {
            "term": { "gender": "female" }
          },
          {
            "range": { "age": {"gte":"50"} }
          }
      ]
    }
  }
}
All the matching documents in the result of the query above will have a score of zero.

Leaf query clauses

While building our outer layout, we decided what are the building blocks of our query. We also decided which fields will determine our results score. As you can see elasticsearch has a lot of options and we only covered the basics in this article. Each compound query can wrap other compound queries and so on. My advice to you is to try to keep it as simple as possible.
Now it is time to write our inner/leaf search query (what comes inside our container clauses).
Here we also have decisions to make 🙂
For every field that we search on, we need to:
Decide if Is this field is relevant to the score of the documents?
  • Yes :  put it inside a
    query 
    clause
  • No :  it should be under
    filter 
    clause(remember that a filter can only be nested inside a Boolean clause)
Check the type of the field and how it was mapped?
  • Querying text fields, for example, is tricky. If the text field was mapped as a
    keyword
    , then we only have the option of searching it in the exact way it was indexed (not tokenized, uppercase/lowercase letters, etc…).
Let’s say, for example, we have indexed a document with “notes” field and it contains the text — “The quick BRown fox”
  • If the
    notes 
    field was mapped as Keyword, then the inverted index would contain “The quick BRown fox” text mapped to that document. searching “The quick BRown fox” text exactly will match that document.
  • If
    notet 
    field was mapped as full text, then in the inverted index, we'll have the tokens : [the, quick, brown, fox] separately connected to the document — searching any of these tokens or their synonyms will match that document.
Decide How our text will be sent the search engine
When we send a query to the elasticsearch engine, we have two options:
1. Send it as is 
For that choice we use the term level queries. For example, if you search for the phrase “Star Trek” then the query engine will check the inverted index for “Star Trek”
2. Send it analyzed
For that choice, we use the full-text queries. The searched text will pass through the same analyzer as the indexed text passed in the indexing process(we can also provide different analyzers as a property to the search service). It will be tokenized and filtered. For example, if you search for the phrase “Star Trek” then the query engine will check the inverted index for [“star”, “trek”] (depends on the analyzer you chose).
Note : If the field was originally mapped as a keyword, then you will have to send the exact text as it was indexed, to get results
Most of the time we will want the searched text to be analyzed before it is sent to the search engine, it will give better results. But sometimes we want to search the exact word or sentence, usually in data like numbers, dates, and enums.
Full-text query example

{
  "query": {
    "match": {
      "mail_body": "Jeff BRidges"
    }
  }
}
Term query example
{
  "query": {
    "term": {
      "mail_from": "emma@somemail.com"
    }
  }
}
Compound query —final detailed example
{
  "query": {
    "bool": {
      "must": {
        "match": {
            "mail_body": "Jeff BRidges"
        }
      },
      "filter": {
        "term": {
            "mail_from": "emma@somemail.com"
        }
      }
    }
  }
}
  • query: 
     main query containe
  • bool
    :  compound query container
  • must
    : this is a query context query, each leaf query inside it will contribute to the score of the matching documents
  • match
     :  this is a full-text query, means that the text “Jeff BRidges” will pass through the analyzer and transformed to [“jeff”, “bridges”]. Make sure you use that option only if the “mail_body” field was mapped as a full-text field.
  • filter
     :  this is a filter context query, each leaf query inside it will not contribute to the score of the matching documents and the clauses are considered for caching.
  • term
    : this is a term level query, the text “emma@somemail.com” will not pass through the analyzer and be sent as is to the search engine.

Final Thoughts

Elasticsearch query DSL is fairly not the simplest thing to use, but once you know how to use it, it can be a powerful tool. In this article, I tried to give you guys a jump start for querying elasticsearch and I encourage you to dig deeper into the elasticsearch documentation.

Next Steps

Once you understand all the concepts we discussed in the article, you'll find it easier to walk through the Elasticsearch documentation and find all the solutions you need.

Tags

More by royibeni

Topics of interest