Filters
Many applications need to give users the power to customize queries in ways that complement what search queries alone can do. In this chapter you are going to learn about filtering, a technique that makes it possible to specify that a search query is executed only on the subset of the documents contained in an index that satisfy a given condition.
Introduction to Boolean Queries
Before you can implement filters you have to understand how compound queries are implemented in Elasticsearch.
A compound query allows an application to combine two or more individual queries, so that they execute together, and if appropriate, return a combined set of results. The standard way to create compound queries in Elasticsearch is to use a Boolean query.
A boolean query acts as a wrapper for two or more individual queries or clauses. There are four different ways to combine queries:
bool.must
: the clause must match. If multiple clauses are given, all must match (similar to an AND logical operation).bool.should
: when used withoutmust
, at least one clause should match (similar to an OR logical operation). When combined withmust
each matching clause boosts the relevance score of the document.bool.filter
: only documents that match the clause(s) are considered search result candidates.bool.must_not
: only documents that do not match the clause(s) are considered search result candidates.
As you can probably guess from the above, boolean queries involve a fair amount of complexity and can be used in a variety of ways. In this chapter you are going to learn how to combine the multi-match full-text search clause implemented in the previous chapters with a filter that restricts results to one category of documents. Recall that the dataset used with this tutorial includes a category
field that can be set to sharepoint
, teams
or github
.
Adding a Filter to a Query
The multi-match query that is currently implemented in the tutorial application uses the following structure:
{
'multi_match': {
'query': "query text here",
'fields': ['name', 'summary', 'content'],
}
}
To add a filter that restricts this search to a specific category, the query must be expanded as follows:
{
'bool': {
'must': [{
'multi_match': {
'query': "query text here",
'fields': ['name', 'summary', 'content'],
}
}],
'filter': [{
'term': {
'category.keyword': {
'value': "category to filter"
}
}
}]
}
}
Let's look at the new components in this query in detail.
First of all, the multi_match
query has been moved inside a bool.must
clause. The bool.must
clause is usually the place where the base query is defined. Note that must
accepts a list of queries to search for, so this allows multiple base-level queries to be combined when desired.
The filtering is implemented in a bool.filter
section, using a new query type, the term
query. Using a match
or multi_match
query for a filter is not a good idea, because these are full-text search queries. For the purpose of filtering, the query must return an absolute true or false answer for each document and not a relevance score like the match queries do.
The term query performs an exact search for the a value in a given field. This type of query is useful to search for identifiers, labels, tags, or as in this case, categories.
This query does not work well with fields that are indexed for full-text search. String fields are assigned a default type of text, and have their contents analyzed and separated into individual words before they are indexed. Elasticsearch assigns string fields a secondary type of keyword, which indexes the field contents as a whole, making them more appropriate for filtering with the term
query. By using a field name of category.keyword
in the filter portion of the query, the keyword
typed variant of the field is used instead of the default text
one.
Specifying a Filter
Before the filtered query can be implemented, it is necessary to add a way for end users to enter a desired filter. The solution implemented in this tutorial will look for a category:<category-name>
pattern in the text of the search query. Let's add a function called extract_filters()
to app.py to look for filter expressions:
def extract_filters(query):
filters = []
filter_regex = r'category:([^\s]+)\s*'
m = re.search(filter_regex, query)
if m:
filters.append({
'term': {
'category.keyword': {
'value': m.group(1)
}
}
})
query = re.sub(filter_regex, '', query).strip()
return {'filter': filters}, query
The function accepts the query entered by the user and returns a tuple with the filters that were found in the query, and the modified query after the filters were removed. To look for the filter pattern it uses a regular expression. The function is designed to be expanded with additional filters.
When a filter is found, the filters
list is extended with a corresponding filter expression, which in this case is based on the term
query, as discussed above.
To better understand how this function works, start a Python session (make sure the virtual environment is activated first) and run the following code:
from app import extract_filters
extract_filters('this is the search text category:sharepoint')
The returned tuple from the function should be:
{'filter': [{'term': 'category.keyword': {'value': 'sharepoint'}}]}, 'this is the search text'
Implementing the Filtered Search
What remains to do is to change the handle_search()
function to send an updated query that combines the full-text search expression with a filter, if one is given by the user. Below is the new version of this function:
@app.post('/')
def handle_search():
query = request.form.get('query', '')
filters, parsed_query = extract_filters(query)
from_ = request.form.get('from_', type=int, default=0)
results = es.search(
query={
'bool': {
'must': {
'multi_match': {
'query': parsed_query,
'fields': ['name', 'summary', 'content'],
}
},
**filters
}
},
size=5,
from_=from_
)
return render_template('index.html', results=results['hits']['hits'],
query=query, from_=from_,
total=results['hits']['total']['value'])
The query has now been changed to send a bool
expression, and the search expression was moved inside a must
section under it. The extract_filters()
function returns the filter portion of the query in the form it needs to be sent to Elasticsearch, so it is inserted in the query dictionary also under the top-level bool
key.
Try a search query such as work from home category:sharepoint
to see how only documents from the given category are returned.
Range Filters
Elasticsearch supports a variety of filters besides the term
filter. Another one that is commonly used is the range
filter, which works with numbers and dates. Let's add a year
filter that can be used to restrict results based on the year they were last updated, which is given in the updated_at
field.
Below is an updated version of the extract_filters()
function that looks for both category:<category>
and year:<yyyy>
as filters:
def extract_filters(query):
filters = []
filter_regex = r'category:([^\s]+)\s*'
m = re.search(filter_regex, query)
if m:
filters.append({
'term': {
'category.keyword': {
'value': m.group(1)
}
},
})
query = re.sub(filter_regex, '', query).strip()
filter_regex = r'year:([^\s]+)\s*'
m = re.search(filter_regex, query)
if m:
filters.append({
'range': {
'updated_at': {
'gte': f'{m.group(1)}||/y',
'lte': f'{m.group(1)}||/y',
}
},
})
query = re.sub(filter_regex, '', query).strip()
return {'filter': filters}, query
This version adds a second regular expression to find year:yyyy
in the query string. It creates a range
filter for the updated_at
field, and sets the low and high bounds of the range to the year that is given after the colon, which is captured in the regular expression match as m.group(1)
.
There is a small complication, because the updated_at
field contains full dates, and in this filter only needs to look at the year. Luckily, when the range filter is used with date field the bounds of the range can be enhanced with date math. The ||/y
suffix that is added to the gte
(lower bound) and lte
(upper bound) parameters of the range indicates that the given value is a year that must be completed to form a full date that can be compared against the field.
With this change, you can include a query such as year:2020 work from home
to see results from the requested year only. The query can include the two filters as well, for example year:2020 category:teams work from home
.
The match-all query
Before moving on to a new topic, try entering only a filter in the search query text field, for example category:github
. Unfortunately this does not return any results, but the expected behavior in this case would be to receive all the results that match the requested category.
What happens is that the extract_filters()
function returns a tuple with the filter(s) in the first element and an empty query string in the second element. The multi_match
query receives the empty string, and returns an empty list of results, because nothing matches an empty string.
To address this special case, the multi_match
query can be replaced with match_all
when the search text is empty. The version of the handle_search()
function below adds logic to do this. Update the function in app.py.
@app.post('/')
def handle_search():
query = request.form.get('query', '')
filters, parsed_query = extract_filters(query)
from_ = request.form.get('from_', type=int, default=0)
if parsed_query:
search_query = {
'must': {
'multi_match': {
'query': parsed_query,
'fields': ['name', 'summary', 'content'],
}
}
}
else:
search_query = {
'must': {
'match_all': {}
}
}
results = es.search(
query={
'bool': {
**search_query,
**filters
}
},
size=5,
from_=from_
)
return render_template('index.html', results=results['hits']['hits'],
query=query, from_=from_,
total=results['hits']['total']['value'])
With this version, you can ask for all the documents that match a category. Note how all the results that are returned come back with the same score of 1.0, because there are no search terms to compute scores.