Scalar quantization 101

Scalar quantization 101

Most embedding models output float32float32 vector values. While this provides the highest fidelity, it is wasteful given the information that is actually important in the vector. Within a given data set, embeddings never require all 2 billion options for each individual dimension. This is especially true on higher dimensional vectors (e.g. 386 dimensions and higher). Quantization allows for vectors to be encoded in a lossy manner, thus reducing fidelity slightly with huge space savings.

Buckets of fun

Scalar quantization takes each vector dimension and buckets them into some smaller data type. For the rest of the blog, we will assume quantizing float32float32 values into int8int8. To bucket values accurately, it isn't as simple as rounding the floating point values to the nearest integer. Many models output vectors that have dimensions continuously on the range [1.0,1.0][-1.0, 1.0]. So, two different vector values 0.123 and 0.321 could both be rounded down to 0. Ultimately, a vector would only use 2 of its 255 available buckets in int8int8, losing too much information.

Quantization illustration

Figure 1: Illustration of quantization goals, bucketing continuous values from 1.0-1.0 to 1.01.0 into discrete int8int8 values.

The math behind the numerical transformation isn't too complicated. Since we can calculate the minimum and maximum values for the floating point range, we can use min-max normalization and then linearly shift the values.

int8127maxmin×(float32min)int8 \approx \frac{127}{max - min} \times (float32 - min) float32maxmin127×int8+minfloat32 \approx \frac{max - min}{127} \times int8 + min

Figure 2: Equations for transforming between int8int8 and float32float32. Note, these are lossy transformations and not exact. In the following examples, we are only using positive values within int8. This aligns with the Lucene implementation.

Buckets of fun

A quantile is a slice of a distribution that contains a certain percentage of the values. So, for example, it may be that 99%99\% of our floating point values are between [0.75,0.86][-0.75, 0.86] instead of the true minimum and maximum values of [1.0,1.0][-1.0, 1.0]. Any values less than -0.75 and greater than 0.86 are considered outliers. If you include outliers when attempting to quantize results, you will have fewer available buckets for your most common values. And fewer buckets can mean less accuracy and thus greater loss of information.

Quantile illustration

Figure 3: Illustration of the 99%99\% confidence interval and the individual quantile values. 99%99\%% of all values fall within the range [0.75,0.86][-0.75, 0.86].

This is all well and good, but now that we know how to quantize values, how can we actually calculate distances between two quantized vectors? Is it as simple as a regular dot_product?

Time to remember your Algebra

We are still missing one vital piece, how do we calculate the distance between two quantized vectors. While we haven't shied away from math yet in this blog, we are about to do a bunch more. Time to break out your pencils and try to remember polynomials and basic algebra.

The basic requirement for dot_product and cosine similarity is being able to multiply floating point values together and sum up their results. We already know how to transform between float32float32 and int8int8 values, so what does multiplication look like with our transformations?

float32i×float32i(maxmin127×int8i+min)×(maxmin127×int8i+min)float32_i \times float32'_i \approx (\frac{max - min}{127} \times int8_i + min) \times (\frac{max - min}{127} \times int8'_i + min)

We can then expand this multiplication and to simplify we will substitute α\alpha for maxmin127\frac{max - min}{127}.

α2×int8i×int8i+α×int8i×min+α×int8i×min+min2\alpha^2 \times int8_i \times int8'_i + \alpha \times int8_i \times min + \alpha \times int8'_i \times min + min^2

What makes this even more interesting, is that only one part of this equation requires both values at the same time. However, dot_product isn't just two floats being multiplied, but all the floats for each dimension of the vector. With vector dimension count dimdim in hand, all the following can be pre-calculated at query time and storage time.

dim×α2dim\times\alpha^2 is just dim×(maxmin127)2dim\times(\frac{max-min}{127})^2 and can be stored as a single float value.

i=0dim1min×α×int8i\sum_{i=0}^{dim-1}min\times\alpha\times int8_i and i=0dim1min×α×int8i\sum_{i=0}^{dim-1}min\times\alpha\times int8'_i can be pre-calculated and stored as a single float value or calculated once at query time.

dim×min2dim\times min^2 can be pre-calculated and stored as a single float value.

Of all this:

dim×α2×dotProduct(int8,int8)+i=0dim1min×α×int8i+i=0dim1min×α×int8i+dim×min2dim \times \alpha^2 \times dotProduct(int8, int8') + \sum_{i=0}^{dim-1}min\times\alpha\times int8_i + \sum_{i=0}^{dim-1}min\times\alpha\times int8'_i + dim\times min^2

The only calculation required for dot_product is just dotProduct(int8,int8)dotProduct(int8, int8') with some pre-calculated values combined with the result.

But, how is this accurate?

So, how is this accurate at all? Aren't we losing information by quantizing? Yes, we are, but quantization takes advantage of the fact that we don't need all the information. For learned embeddings models, the distributions of the various dimensions usually don't have fat-tails. This means they are localized and fairly consistent. Additionaly, the error introduced per dimension via quantization is independent. Meaning, the error cancels out for our typical vector operations like dot_product.


Whew, that was a ton to cover. But now you have a good grasp of the technical benefits of quantization, the math behind it, and how you can calculate the distances between vectors while accounting for the linear transformation. Look next at how we implemented this in Lucene and some of the unique challenges and benefits available there.

Share this article