Sparse vector queries take advantage of Elasticsearch’s powerful inference API, allowing easy built-in setup for Elastic-hosted models such as ELSER and E5, as well as the flexibility to host other models.
Introduction
Vector search is evolving, and as our needs for vector search evolve so does the need for a consistent and forward thinking vector search API.
When Elastic first launched semantic search, we leveraged existing rank_features fields using the text_expansion query. We then reintroduced the sparse_vector field type for semantic search use cases.
As we think about what sparse vector search is going forward, we’ve introduced a new sparse vector query. As of Elasticsearch 8.15.0, both the text_expansion query and weighted_tokens query have been deprecated in favor of the new sparse vector query.
The sparse vector query supports two modes of querying: using an inference ID and using precomputed query vectors. Both modes of querying require data to be indexed in a sparse_vector mapped field.
These token-weight pairs are then used in a query against a sparse vector. At query time, query vectors are calculated using the same inference model that was used to create the tokens.
Let’s look at an example: let’s say we’ve indexed a document detailing when Orion is most visible in the night sky:
Now, assume we’re looking for constellations that are visible in the northern hemisphere, and we run this query through the same learned sparse encoder model. The output might look similar to this:
At query time, these vectors are ORed together, and scoring is effectively a dot product calculation between the stored dimensions and the query dimensions, which would score this example at 10.84:
Sparse vector queries with inference
Sparse vector queries using inference work in a very similar way to the previous text expansion query, instead of sending in a trained model, we create an inference endpoint associated with the model we want to use.
Here’s an example of how to create an inference endpoint for ELSER:
PUT _inference/sparse_embedding/my-elser-endpoint
{
"service": "elser",
"service_settings": {
"num_allocations": 1,
"num_threads": 1
}
}
You should use an inference endpoint to index your sparse vector data, and use the same endpoint as input to your sparse_vector query. For example:
POST my-index/_search
{
"query": {
"sparse_vector": {
"field": "embeddings",
"inference_id": "my-elser-endpoint",
"query": "constellations in the northern hemisphere"
}
}
}
Sparse vector queries with precomputed query vectors
You may have precomputed vectors that don’t require inference at query time. These can be sent into the sparse_vector query instead of using inference. Here is an example:
POST my-index/_search
{
"query": {
"sparse_vector": {
"field": "embeddings",
"query_vector": {
"constellation": 2.5,
"northern": 1.9,
"hemisphere": 1.8,
"orion": 1.5,
"galaxy": 1.4,
"astronomy": 0.9,
"telescope": 0.3,
"star": 0.01
}
}
}
}
Query optimization with token pruning
Like text expansion search, the sparse vector query is subject to performance penalties from huge boolean queries. Therefore the same token pruning strategies available for text expansion strategies are available in the sparse vector query. You can see the impact of token pruning in our nightly MS Marco Passage Ranking benchmarks.
In order to enable pruning with the default pruning configuration (which has been tuned for ELSER V2), simply add prune: true
to your request:
POST my-index/_search
{
"query": {
"sparse_vector": {
"field": "embeddings",
"inference_id": "my-elser-endpoint",
"query": "constellations in the northern hemisphere",
"prune": true
}
}
}
Alternately, you can adjust the pruning configuration by sending it directly in with the request:
GET my-index/_search
{
"query":{
"sparse_vector":{
"field": "embeddings",
"inference_id": "my-elser-endpoint",
"query": "constellations in the northern hemisphere",
"prune": true,
"pruning_config": {
"tokens_freq_ratio_threshold": 5,
"tokens_weight_threshold": 0.4,
"only_score_pruned_tokens": false
}
}
}
}
Because token pruning will incur a recall penalty, we recommend adding the pruned tokens back in a rescore:
GET my-index/_search
{
"query":{
"sparse_vector":{
"field": "embeddings",
"inference_id": "my-elser-endpoint",
"query": "constellations in the northern hemisphere",
"prune": true,
"pruning_config": {
"tokens_freq_ratio_threshold": 5,
"tokens_weight_threshold": 0.4,
"only_score_pruned_tokens": false
}
}
},
"rescore": {
"window_size": 100,
"query": {
"rescore_query": {
"sparse_vector": {
"field": "embeddings",
"inference_id": "my-elser-endpoint",
"query": "constellations in the northern hemisphere",
"prune": true,
"pruning_config": {
"tokens_freq_ratio_threshold": 5,
"tokens_weight_threshold": 0.4,
"only_score_pruned_tokens": true
}
}
}
}
}
}
What's next?
While the text_expansion
query is GA’d and will be supported throughout Elasticsearch 8.x, we recommend updating to the sparse_vector
query as soon as possible in order to ensure you’re using the most up to date features as we continually improve the vector search experience in Elasticsearch.
If you are using the weighted_tokens
query, this was never GA’d and will be replaced by the sparse_vector query very soon.
The sparse_vector
query will be available starting with 8.15.0 and is already available in Serverless - try it out today!