Learn about the trade-offs using semantic reranking in search and RAG pipelines.

Retrieval

Semantic Reranking

Cross-encoders

Connection with RAG

Wrapping Up

Semantic search reranking: What it is and how to use it

What is semantic reranking and how to use it?

Understand what binary quantization is, how it works and its benefits. This guide also covers the math behind the quantization and examples.

Introduction

Building the Bit Vectors

Find a Representative Centroid

1 Bit, 1 Bit Only Please

The Catch

The Query

Estimated Distance

Re-ranking

Conclusion

Quantization allows for vectors to be encoded in a lossy manner, thus reducing fidelity slightly with huge space savings.

What are the benefits of binary quantization?

Binary quantization takes each vector and transforms it relative to a prototypical set of sample vectors creating a smaller representation that can be used to approximate distances at great space and computation savings.

How does binary quantization work?

Binary quantization does not really create codebooks and instead compresses each data vector down to a 1 bit representation that is the inclusion in a region around a centroid.  The authors of the ![RaBitQ paper](https://arxiv.org/pdf/2405.12497), that is a large driver for our understanding and implementation of binary quantization, mention how the codebooks in PQ are related to the regions around any given centroid so if it's more intuitive to think of these regions as codebooks then yes.

Does binary quantization create codebooks similar to PQ?

Binary quantization relies on finding regions first.  Query vectors similarly are transformed into this space using the same centroid that allowed transformation into the region that was used to quantize the data vectors.  This is what allows the distance estimation comparison to happen.

Are the query vectors converted to 1 bit prior to comparison with the codebook / regions created from the data vectors?

Yes, if the number of vectors justifies it.  Although, even for millions of vectors, we have found that only one centroid is sufficient for many datasets and models.

Are there multiple centroids?

No, since very few centroids are required they can be maintained in their float32 form without concern for scaling.

Are the centroids converted to a 1 bit representation as well?

We've found so far that we can reduce complexity, both algorithmically and performance-wise, by removing this random transformation.  While some vectors may be favored, traditionally reranking a reasonable set of data, which is required to achieve reasonable recall, seems to negate the need for this in practice.

The author's of the ![RaBitQ paper](https://arxiv.org/pdf/2405.12497) mention transforming the data with a random orthogonal matrix P why do you not include it here?

We've found so far that a single centroid for most production datasets works well.  Additionally when talking about how binary quantization works we can focus on a single centroid at a time.  As you have more vectors, >100m, you may find that recall is impacted.  When using Elasticsearch these large datasets can be pre-partitioned and multiple centroids will naturally be used.

The author's of the ![RaBitQ paper](https://arxiv.org/pdf/2405.12497) mention using multiple centroids but you focus on a single centroid here.  Why only one centroid?

Better Binary Quantization 101: An Introduction

Better Binary Quantization 101

Learn about how Elastic's new re-ranker model was trained and how it performs

How does it compare?

Architecture

Data sets and training

Summary

Elastic Rerank. Semantic re-ranker model

Introducing Elastic Rerank: Elastic's new semantic re-ranker model

Using the Phi-3 language model as a relevance judge, with tips & techniques to improve the agreement with human-generated annotation

A Short Introduction to Phi Models

A note on terminology

Our experiment

About the annotation task

About the language model

Results Summary

The role of word choice on prompt performance

Tone words

Output instructions

"Take a step back" reasoning

Role playing

More precise instructions

Pairwise

Few-shot chain-of-thought

Ensemble

Microsoft Phi-3 model for evaluating search relevance

Evaluating search relevance part 2 - Phi-3 as relevance judge

Learn to evaluate your search system in the context of better understanding the BEIR benchmark, with tips & techniques to improve your search evaluation processes.

Understanding the BEIR benchmark in search relevance evaluation

Structure of a BEIR dataset

Leveraging the BEIR benchmark for search relevance evaluation

Main takeaways & next steps

The BEIR benchmark & Elasticsearch search relevance evaluation

Evaluating search relevance part 1 - The BEIR benchmark

Learn how scalar quantization can be used to reduce the memory footprint of vector embeddings in Elasticsearch through an experiment.

Understanding scalar quantization in Elasticsearch

Experimentation: Evaluating scalar quantization

Overview of methodology

Results

The benefits of using scalar quantization in Elasticsearch include reducing the memory footprint of vector embeddings without significantly affecting retrieval performance.

What are the benefits of using scalar quantization in Elasticsearch?

Evaluating scalar quantization in Elasticsearch

Optimizing scalar quantization for the vector database use case allows us to achieve significantly better performance for the same retrieval quality at high compression ratios.

Scalar quantization recap

Novelties introduced to scalar quantization

1. Error correcting the scalar dot product

2. Optimizing the truncation interval

Proof of principle for int4 quantization

Scalar quantization optimized for vector databases

This blog explains how int4 quantization works in Lucene, how it lines up, and the benefits of using int4 quantization.

Introduction to Int4 quantization in Lucene

How does `Int4` quantization work in Lucene

Storing and scoring the quantized vectors

Calculating the quantization error correction

Finding the optimal bucketing for int4 quantization

Speed vs. size for quantization

Speed part 2: more SIMD in int4

The end?

Int4 provides additional compression options. It reduces the quantization space to only 16 possible values (0 through 15).

What is Int4 quantization in Lucene?

Understanding Int4 scalar quantization in Lucene

Explore RAG evaluation metrics like BLEU score, ROUGE score, PPL, BARTScore, and more. Discover how Elastic is evaluating RAG with UniEval.

N-gram metrics

BLEU score

ROUGE score

METEOR score

Intrinsic metrics

Perplexity (PPL)

Model-based metrics

BERTScore

BLEURT

BARTScore

UniEval: Elastic’s choice for evaluating RAG

Real-world usage of UniEval

There are various metrics used to evaluate RAG, such as: N-gram metrics (including BLEU score, ROUGE score & METEOR score), Intrinsic metrics (like PPL), Model-based metrics (such as BERTScore, BLEURT and BARTScore), and Elastic's choice: UniEval.

What metrics are commonly used to evaluate RAG?

UniEval evaluates RAG by unifying all evaluation dimensions into a Boolean Question Answering framework, allowing a single model to assess a generated text from various angles.

ML Research

Exploring depth in a 'retrieve-and-rerank' pipeline