Ingest

In this application the ingest of all the example documents is triggered with the flask create-index command. The implementation of this command is in the app.py file in the api directory and it simply imports the index_data.py module from the data directory and calls its main() function, which performs a complete import of all the documents stored in the data.json file.

Document Structure

The structure of each document is as follows:

  • name: the document title
  • url: a URL to the document hosted on an external site
  • summary: a short summary of the contents of the document
  • content: the body of the document
  • created_on: creation date
  • updated_at: update date (could be missing if the document was never updated)
  • category: the document's category, which can be github, sharepoint or teams
  • rolePermissions: a list of role permissions

From these, this example application uses the content field as the text to be indexes, and adds name, summary, url, category and updated_at as associated metadata.

The following snippet of Python code shows how the documents are imported:

metadata_keys = ['name', 'summary', 'url', 'category', 'updated_at']
workplace_docs = []
with open(FILE, 'rt') as f:
    for doc in json.loads(f.read()):
        workplace_docs.append(Document(
            page_content=doc['content'],
            metadata={k: doc.get(k) for k in metadata_keys}
        ))

Here the json module from the Python standard library is used to read the data file, and then for each included document a Document object from Langchain is created. Documents have a page_content attribute that defines the content to be converted into vectors and searched, plus a number of additional fields that are stored as metadata. The metadata_keys determines which fields from the source content are to be stored as document metadata.

Depending on your ingest needs method can be refined or changed. The Langchain project provide a large selection of document loaders that can be used depending on how the format of the source content.

The Elastic Learned Sparse EncodeR (ELSER) Model

The Elasticsearch index used in this application is configured to automatically create sparse vector embeddings for all documents that are inserted. The install_elser() function in index_data.py makes sure that the ELSER model is installed and deployed on the Elasticsearch instance that you are using.

Text Splitting

The content field in these documents is long, which means that a single embedding will be unable to fully represent it. The standard solution when working with large amounts of text is to split the text into shorter passages, and then obtain embeddings for the individual passages, all of which are stored and indexed.

In this application, the RecursiveCharacterTextSplitter class from the Langchain library is used, paired with OpenAI's tiktoken encoder, which counts lengths of the passages in tokens, the same units that LLMs use.

Consider the following example, which demonstrates how text splitting works in the application:

>>> from langchain.docstore.document import Document
>>> from langchain.text_splitter import RecursiveCharacterTextSplitter
>>> doc = Document(page_content='the quick brown fox jumped over the lazy dog')
>>> text_splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(chunk_size=5, chunk_overlap=2)
>>> split_docs = text_splitter.transform_documents([doc])
>>> split_docs
[Document(page_content='the quick brown fox jumped'),
 Document(page_content='fox jumped over the lazy'),
 Document(page_content='the lazy dog')]

By setting the chunk_size argument of the text splitter, it is possible to control the length of the resulting passages. The chunk_overlap allows for some amount of overlap between passages, which often helps obtain better embeddings.

In the actual application, the splitter is initialized with the following arguments:

text_splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(
    chunk_size=512, chunk_overlap=256
)

You are welcome to change these values and see how changes affect the quality of the chatbot. Each time you change the splitter's configuration you should re-generate the index by running the flask create-index command.

Note that there are more considerations on text splitting in combination with the ELSER model that you should be aware of. For production use cases you might have to choose another tokenization method as in this tutorial.

Document Store

Documents are stored in an Elasticsearch index. The name of the index is controlled by the ES_INDEX environment variable, which is defined in the .env file. By default the name of this index is workplace-app-docs.

The application uses the ElasticsearchStore class, which is part of the Elasticsearch integration in Langchain, and uses the official Elasticsearch client library for Python.

The complete logic that deals with the Elasticsearch index is shown below:

from elasticsearch import Elasticsearch, NotFoundError
from langchain_elasticsearch import ElasticsearchStore

INDEX = os.getenv("ES_INDEX", "workplace-app-docs")
ELASTIC_CLOUD_ID = os.getenv("ELASTIC_CLOUD_ID")
ELASTICSEARCH_URL = os.getenv("ELASTICSEARCH_URL")
ELASTIC_API_KEY = os.getenv("ELASTIC_API_KEY")
ELSER_MODEL = os.getenv("ELSER_MODEL", ".elser_model_2")

# create an Elasticsearch client instance
if ELASTICSEARCH_URL:
    elasticsearch_client = Elasticsearch(
        hosts=[ELASTICSEARCH_URL],
    )
elif ELASTIC_CLOUD_ID:
    elasticsearch_client = Elasticsearch(
        cloud_id=ELASTIC_CLOUD_ID, api_key=ELASTIC_API_KEY
    )
else:
    raise ValueError(
        "Please provide either ELASTICSEARCH_URL or ELASTIC_CLOUD_ID and ELASTIC_API_KEY"
    )

# delete the existing index, if found
elasticsearch_client.indices.delete(index=INDEX, ignore_unavailable=True)

# write documents stored in "docs" to the index
ElasticsearchStore.from_documents(
    workplace_docs,
    es_connection=elasticsearch_client,
    index_name=INDEX,
    strategy=ElasticsearchStore.SparseVectorRetrievalStrategy(model_id=ELSER_MODEL),
)

The ElasticsearchStore.from_documents() method imports all the Document instances stored in workplace_docs, writing them to the index given in the index_name argument. All operations are performed through the client given in the es_connection argument.

The strategy argument defines the way this index is going to be used. For this application, the SparseVectorRetrievalStrategy class indicates that sparse vector embeddings are to be maintained for each document. This will add a pipeline to the index that will generate embeddings through the requested model (ELSER version 2 in this case).

The Elasticsearch integration with Langchain provides a other strategies that can be used depending on the use case. In particular, the ApproxRetrievalStrategy can be used when dense vector embeddings are used.

Share this article