Ingest
In this application the ingest of all the example documents is triggered with the flask create-index
command. The implementation of this command is in the app.py file in the api directory and it simply imports the index_data.py module from the data directory and calls its main()
function, which performs a complete import of all the documents stored in the data.json file.
Document Structure
The structure of each document is as follows:
name
: the document titleurl
: a URL to the document hosted on an external sitesummary
: a short summary of the contents of the documentcontent
: the body of the documentcreated_on
: creation dateupdated_at
: update date (could be missing if the document was never updated)category
: the document's category, which can begithub
,sharepoint
orteams
rolePermissions
: a list of role permissions
From these, this example application uses the content
field as the text to be indexes, and adds name
, summary
, url
, category
and updated_at
as associated metadata.
The following snippet of Python code shows how the documents are imported:
metadata_keys = ['name', 'summary', 'url', 'category', 'updated_at']
workplace_docs = []
with open(FILE, 'rt') as f:
for doc in json.loads(f.read()):
workplace_docs.append(Document(
page_content=doc['content'],
metadata={k: doc.get(k) for k in metadata_keys}
))
Here the json
module from the Python standard library is used to read the data file, and then for each included document a Document
object from Langchain is created. Documents have a page_content
attribute that defines the content to be converted into vectors and searched, plus a number of additional fields that are stored as metadata. The metadata_keys
determines which fields from the source content are to be stored as document metadata.
Depending on your ingest needs method can be refined or changed. The Langchain project provide a large selection of document loaders that can be used depending on how the format of the source content.
The Elastic Learned Sparse EncodeR (ELSER) Model
The Elasticsearch index used in this application is configured to automatically create sparse vector embeddings for all documents that are inserted. The install_elser()
function in index_data.py makes sure that the ELSER model is installed and deployed on the Elasticsearch instance that you are using.
Text Splitting
The content
field in these documents is long, which means that a single embedding will be unable to fully represent it. The standard solution when working with large amounts of text is to split the text into shorter passages, and then obtain embeddings for the individual passages, all of which are stored and indexed.
In this application, the RecursiveCharacterTextSplitter
class from the Langchain library is used, paired with OpenAI's tiktoken encoder, which counts lengths of the passages in tokens, the same units that LLMs use.
Consider the following example, which demonstrates how text splitting works in the application:
>>> from langchain.docstore.document import Document
>>> from langchain.text_splitter import RecursiveCharacterTextSplitter
>>> doc = Document(page_content='the quick brown fox jumped over the lazy dog')
>>> text_splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(chunk_size=5, chunk_overlap=2)
>>> split_docs = text_splitter.transform_documents([doc])
>>> split_docs
[Document(page_content='the quick brown fox jumped'),
Document(page_content='fox jumped over the lazy'),
Document(page_content='the lazy dog')]
By setting the chunk_size
argument of the text splitter, it is possible to control the length of the resulting passages. The chunk_overlap
allows for some amount of overlap between passages, which often helps obtain better embeddings.
In the actual application, the splitter is initialized with the following arguments:
text_splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(
chunk_size=512, chunk_overlap=256
)
You are welcome to change these values and see how changes affect the quality of the chatbot. Each time you change the splitter's configuration you should re-generate the index by running the flask create-index
command.
Note that there are more considerations on text splitting in combination with the ELSER model that you should be aware of. For production use cases you might have to choose another tokenization method as in this tutorial.
Document Store
Documents are stored in an Elasticsearch index. The name of the index is controlled by the ES_INDEX
environment variable, which is defined in the .env file. By default the name of this index is workplace-app-docs
.
The application uses the ElasticsearchStore
class, which is part of the Elasticsearch integration in Langchain, and uses the official Elasticsearch client library for Python.
The complete logic that deals with the Elasticsearch index is shown below:
from elasticsearch import Elasticsearch, NotFoundError
from langchain_elasticsearch import ElasticsearchStore
INDEX = os.getenv("ES_INDEX", "workplace-app-docs")
ELASTIC_CLOUD_ID = os.getenv("ELASTIC_CLOUD_ID")
ELASTICSEARCH_URL = os.getenv("ELASTICSEARCH_URL")
ELASTIC_API_KEY = os.getenv("ELASTIC_API_KEY")
ELSER_MODEL = os.getenv("ELSER_MODEL", ".elser_model_2")
# create an Elasticsearch client instance
if ELASTICSEARCH_URL:
elasticsearch_client = Elasticsearch(
hosts=[ELASTICSEARCH_URL],
)
elif ELASTIC_CLOUD_ID:
elasticsearch_client = Elasticsearch(
cloud_id=ELASTIC_CLOUD_ID, api_key=ELASTIC_API_KEY
)
else:
raise ValueError(
"Please provide either ELASTICSEARCH_URL or ELASTIC_CLOUD_ID and ELASTIC_API_KEY"
)
# delete the existing index, if found
elasticsearch_client.indices.delete(index=INDEX, ignore_unavailable=True)
# write documents stored in "docs" to the index
ElasticsearchStore.from_documents(
workplace_docs,
es_connection=elasticsearch_client,
index_name=INDEX,
strategy=ElasticsearchStore.SparseVectorRetrievalStrategy(model_id=ELSER_MODEL),
)
The ElasticsearchStore.from_documents()
method imports all the Document
instances stored in workplace_docs
, writing them to the index given in the index_name
argument. All operations are performed through the client given in the es_connection
argument.
The strategy
argument defines the way this index is going to be used. For this application, the SparseVectorRetrievalStrategy
class indicates that sparse vector embeddings are to be maintained for each document. This will add a pipeline to the index that will generate embeddings through the requested model (ELSER version 2 in this case).
The Elasticsearch integration with Langchain provides a other strategies that can be used depending on the use case. In particular, the ApproxRetrievalStrategy can be used when dense vector embeddings are used.