Generate Embeddings
In this section you are going to learn about one of the most convenient options that are available to generate embeddings for text, which is based on the SentenceTransformers framework.
Working with SentenceTransformers is the recommended path while you explore and become familiar with the use of embeddings, as the models that are available under this framework can be installed on your computer, perform reasonably well without a GPU, and are free to use.
Install SentenceTransformers
The SentenceTransformers framework is installed as a Python package. Make sure that your Python virtual environment is activated, and then run the following command on your terminal to install this framework:
pip install sentence-transformers
As always, any time you add new dependencies to your project it is a good idea to update your requirements file:
pip freeze > requirements.txt
Selecting a Model
The next task is to decide on a machine learning model to use for embedding generation. There is a list of pretrained models in the documentation. Because SentenceTransformers is a very popular framework, there are also compatible models created by researchers not directly associated with the framework. To see a complete list of models that can be used, you can check the SentenceTransformers tag on HuggingFace.
For the purposes of this tutorial there is no need to overthink the model selection, as any model will suffice. The SentenceTransformers documentation includes the following note with regards to their pretrained models:
The all-* models where trained on all available training data (more than 1 billion training pairs) and are designed as general purpose models. The all-mpnet-base-v2 model provides the best quality, while all-MiniLM-L6-v2 is 5 times faster and still offers good quality.
This seems to suggest that their all-MiniLM-L6-v2
model is a good choice that offers a good compromise between speed and quality, so let's use this model. Locate this model in the table, and click the "info" icon to see some information about it.
An interesting detail that is good to be aware of about your chosen model is the length the generated embeddings have, or in other words, how many numbers or dimensions the resulting vectors will have. This is important because it directly affects the amount of storage you will need. In the case of all-MiniLM-L6-v2
, the generated vectors have 384 dimensions.
Loading the Model
The following Python code demonstrates how the model is loaded. You can try this in a Python shell.
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2')
The first time you do this, the model will be downloaded and installed in your virtual environment, so the call may take some time to return. Once the model is installed, instantiating it should not take long.
Generating Embeddings
With the model instantiated, you are now ready to generate an embedding. To do this, pass the source text to the model.encode()
method:
embedding = model.encode('The quick brown fox jumps over the lazy dog')
The result is an array with all the numbers that make up the embedding. As you recall, the embeddings generated by the chosen model have 384 dimensions, so this is the length of the embedding
array.