Building a RAG System With Gemma, Hugging Face & Elasticsearch

RAG using Gemma and Elastic


Google has recently launched the state-of-the-art open model Gemma. It is built on the same research and technology which Gemini developed. You have the option to customize(fine-tune) the Gemma using your private data according to your requirement, ask questions for summarization or utilize it for RAG (Retrieval Augmented Generation) by specifying a context window. It can fulfill different kinds of use cases. Gemma has been released in two sizes - Gemma2B & Gemma7B. Both have pre-trained and instruction-tuned variants. You can pick any of them gemma-2b, gemma-2b-it, gemma-7b, gemma-7b-it. These are text-to-text and decoder-only models that work only with the English language.

This blog will show you how to build a RAG system using Elasticsearch and Python to perform a semantic search and create a question-answering service that runs on your private data set. You will fetch the most relevant documents as a context window and send them to the Gemma model along with a question to be answered.



We will proceed step by step to construct the RAG. We’re going to use LangChain to build a complete flow. LangChain gives the framework to develop LLM-powered apps easily though you can write your own flow to develop a RAG.

  1. Prepare documents to store in Elasticsearch by passing through the ELSERv2 model.
  2. Run the Gemma model locally by using Hugging Face.
  3. Perform semantic search and fetch the most relevant document set.
  4. Ask questions to locally running Gemma models by passing prompts including context windows.

We are going to build a complete flow in Python.

RAG architecture using Gemma, Elastic and LangChain

1. Import packages & credentials

Install required packages

pip install -q -U elasticsearch langchain transformers huggingface_hub torch

Import all dependencies

import json
import os
from getpass import getpass
from urllib.request import urlopen

from elasticsearch import Elasticsearch, helpers
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import ElasticsearchStore
from langchain import HuggingFacePipeline
from langchain.chains import RetrievalQA
from langchain.prompts import ChatPromptTemplate
from langchain.schema.output_parser import StrOutputParser
from langchain.schema.runnable import RunnablePassthrough
from huggingface_hub import login
from transformers import AutoTokenizer, AutoModelForCausalLM
from transformers import AutoTokenizer, pipeline

We’re going to use different modules of LangChain for different purposes.

Get credentials

ELASTIC_API_KEY = getpass("Elastic API Key :")
ELASTIC_CLOUD_ID = getpass("Elastic Cloud ID :")
elastic_index_name = "gemma-rag"

This will accept ELASTIC_API_KEY and ELASTIC_CLOUD_ID from the user input. All data will be stored in the gemma-rag index.

2. Preparing Documents

Let's download the sample dataset and deserialize the document.

url = ""
response = urlopen(url)
workplace_docs = json.loads(

JSON contains workplace data like vacation policy, work-from-home policy, explaining how compensation works, onboarding steps, etc. Consider this is our private data set to which Gemma is not trained or has any access. At the end, our system will find the answer from this JSON data only.

Document chunking

Document chunking in RAG refers to breaking down large documents into smaller segments for more efficient processing and retrieval during question-answering tasks.

Why is document chunking needed?

Context window limit

The context window in RAG (Retrieval Augmented Generation) refers to the portion of text or documents from which the model retrieves relevant information to generate an answer. It helps provide the necessary context for generating accurate and meaningful responses to questions. The size of the context window can vary depending on the specific implementation or LLM limitations.

LLMs have limits on the context window. As mentioned in the technical report, Gemma models have a context length of 8192 tokens. So there is no use in providing a context window of more than 8192 tokens. Chunking helps to maintain your context window limitation.

Note: Tokens are the basic units of data processed by LLMs. In the context of text, a token can be a word, part of a word (subword), or even a character — depending on the tokenization process.

Balancing Context Size

When utilizing Large Language Models (LLMs), it's crucial to consider the size of the context we feed into the model.

LLMs have a limit on the number of tokens they can handle at once. For instance, GPT-3.5-turbo has a token limit of 4096.

Moreover, as the context size increases, the quality of generated responses may decrease, leading to potential inaccuracies or hallucinations.

Handling larger contexts also translates to longer processing times and higher costs associated with LLM usage.

This underscores the importance of mastering the art of retrieval. Achieving the right balance between context chunking and embedding accuracy is key.


For example, in your JSON data, there is one document that contains data about 3 policies i.e. leave, work-from-home & pet policy. Now if you’re searching for “pet policy” or “work from home”, it is going to return the same document. Even, It will be the same for the “leave policy”. This practice may contain extra irrelevant information and with such context window, LLMs can hallucinate the result or the answer won’t be that accurate. Whereas if you split this one document into three documents (three different policies), It will pick an accurate chunk and use it as a context window.

How to chunk data?

There are different strategies for chunking the data-

  1. Fixed length chunking- Splitting a document into fixed size frames such as several characters, words, etc.
  2. Context-aware chunking- Logically and semantically splitting documents.
  3. NLP-driven chunking- Use NLP (Natural language Processing) to chunk the data more effectively. Breaking large bodies of text into manageable chunks allows us to summarize each section separately, resulting in a more precise overall summary.

You can come up with your own logic to chunk the data according to your requirements. In this example, we’re going to use Fixed length chunking. For this, we will be using LangChain’s CharacterTextSplitter(). This process will divide your data according to characters such as new line (\n), dot (.), comma (,), etc., and measure chunk length by a number of characters assigned to the chunk_size.

metadata = []
content = []

for doc in workplace_docs:
            "name": doc["name"],
            "summary": doc["summary"],
            "rolePermissions": doc["rolePermissions"],

text_splitter = CharacterTextSplitter(chunk_size=50, chunk_overlap=0)
docs = text_splitter.create_documents(content, metadatas=metadata)

Here we are going to create the chunks for the content field.

3. Index Documents

Assuming you have downloaded and deployed the ELSERv2 model. ELSER (Elastic Learned Sparse EncodeR) is a retrieval model developed by Elastic. It empowers users to conduct semantic search and enhance the search result relevance by considering contextual meaning and user intent instead of solely relying on exact keyword matches.

We'll utilize the ElasticsearchStore library for document indexing, which is an integral component of langChain's vector store functionality.

es = ElasticsearchStore.from_documents(


Let’s verify if the documents were inserted properly. Log in to Kibana and go to the menu ☰ > Management > Dev Tools. Hit the below query on the gemma-rag index.

GET gemma-rag/_search


        "_index": "gemma-rag",
        "_id": "f0cb9857-6500-41de-89e6-c29ebede31ab",
        "_score": 1,
        "_ignored": [
        "_source": {
          "metadata": {
            "summary": "This policy outlines the guidelines for full-time remote work, including eligibility, equipment and resources, workspace requirements, communication expectations, performance expectations, time tracking and overtime, confidentiality and data security, health and well-being, and policy reviews and updates. Employees are encouraged to direct any questions or concerns",
            "rolePermissions": [
            "name": "Work From Home Policy"
          "vector": {
            "tokens": {
              "19": 1.1510628,
              "2019": 0.83059055,
              "laptop": 0.2694121,
              "rent": 0.17121923,
              "conducting": 0.118694015,
              "freelance": 0.6926271,
              "broad": 0.2849962,
              "guidelines": 1.0599052,
              "supporting": 0.16413163,
              "ensuring": 0.48137796,
              "mask": 0.074894086,
              "delivery": 0.18148012,
              "hours": 0.05213894,
              "comply": 0.20511511,
              "continuity": 0.87717825,
              "mobile": 0.6216534,
              "time": 0.85393053,
              "threat": 0.066342406,
              "pm": 0.19746083
            "model_id": ".elser_model_2"
          "text": """The purpose of this full-time work-from-home policy is to provide guidelines and support for employees to conduct their work remotely, ensuring the continuity and productivity of business operations during the COVID-19 pandemic and beyond.
  • text- This field holds the chunked data.
  • vectors.tokens- Contains all tokens generated by the ELSER model. A semantic search will be performed on this field.

4. Load Gemma model locally using Hugging Face

Why Hugging Face?

Hugging Face is the collaboration platform where we can host and collaborate on unlimited free and open models. You can find different kinds of open models, datasets and demo apps. All are publicly available and open source. You can run all the models locally on your machine using Hugging Face.

Gemma is a state-of-the-art and open model which is hosted on Hugging Face. You can simply run this on your local machine.

Hugging Face login

To get started with Gemma, you need to pass the Hugging Face access token while executing notebook_login().

from huggingface_hub import notebook_login


Huggingface login on local

Enter the Hugging Face access token and click on the Login button.

Initialize the tokenizer with the model (google/gemma-2b-it)

model = AutoModelForCausalLM.from_pretrained("google/gemma-2b-it")
tokenizer = AutoTokenizer.from_pretrained("google/gemma-2b-it")

AutoTokenizer is used to convert user input into a stream of tokens which can be processed by the Gemma models.


  • GPU usage: To run the model on GPU, pass the device_map="auto" parameter in the from_pretrained method.
tokenizer = AutoTokenizer.from_pretrained("google/gemma-2b-it", device_map="auto")
  • CPU Usage: The model will run on the CPU by simply removing the parameter device_map.

You can explore more usage and optimization methods and use them according to your requirements.

Create a text-generation pipeline and initialize with LLM

Here we’re going to use the transformer’s pipeline. It is the abstraction layer of all other pipelines and provides an easy way to use models for inference.

pipe = pipeline(

llm = HuggingFacePipeline(
    model_kwargs={"temperature": 0.7},
  • text-generation: It will return TextGenerationPipeline. This pipeline predicts the words that will follow a specified text prompt.
  • model & tokenizer: Pass the model & tokeninzer which we initialize in the previous step.
  • max_new_tokens: The maximum number of tokens to generate, ignoring the number of tokens in the prompt.
  • device="cuda": This will use CUDA to perform all computations on GPUs.

5. Create a chain using a prompt template

Now we are going to perform a semantic search using retrievers. It is going to use the ELSERv2 model to perform the search.

def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

retriever = es.as_retriever(search_kwargs={"k": 5})

Here "k": 5 shows the maximum number of documents should be returned. All the documents will pass to the format_docs method to concat as a single context window.

We’re going to use a static template where context and question will be the placeholder. Both will get replaced on-the-fly based on the question we’re asking. You can use your own prompts or template.

template = """Answer the question based only on the following context:\n


Question: {question}
prompt = ChatPromptTemplate.from_template(template)

chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()

6. Ask a question

chain.invoke("What is the pet policy in the office?")

RAG answer


In this blog, we've explored how to integrate Gemma into a RAG system using Elasticsearch for semantic search and document retrieval. Gemma models come with more tuning options. Because of their relatively small size, it is possible to deploy them in any environment like laptops, desktops, private servers etc. By following the outlined steps and utilizing the LangChain framework with Python, developers can seamlessly integrate Gemma into their projects and unlock its full potential for generation tasks. Alternatively,You can write the entire flow (RAG) without relying on LangChain by choosing another language. The complete Python notebook showcasing all the above implementations can be found on the elasticsearch-labs repository.

Ready to try this out on your own? Start a free trial.
Looking to build RAG into your apps? Want to try different LLMs with a vector database?
Check out our sample notebooks for LangChain, Cohere and more on Github, and join Elasticsearch Relevance Engine training now.
Recommended Articles
Using NVIDIA NIM with Elasticsearch vector store
Generative AIIntegrationsHow To

Using NVIDIA NIM with Elasticsearch vector store

Explore how NVIDIA NIM enhances applications with natural language processing capabilities. NVIDIA NIM offers features such as in-flight batching, which not only speeds up request processing but also integrates seamlessly with Elasticsearch to boost data indexing and search functionalities.

Alex Salgado

Elasticsearch open inference API adds Azure AI Studio support
IntegrationsHow ToGenerative AIVector Search

Elasticsearch open inference API adds Azure AI Studio support

Elasticsearch open inference API adds support for embeddings generated from models hosted on Azure AI Studio and completion tasks from large language models such as Meta-Llama-3-8B-Instruct."

Mark Hoy

Elasticsearch open inference API adds support for Azure OpenAI chat completions
IntegrationsHow ToGenerative AI

Elasticsearch open inference API adds support for Azure OpenAI chat completions

Elasticsearch open inference API adds support for Azure Open AI chat completions, providing full developer access to the Azure AI ecosystem

Tim Grein

Elasticsearch delivers performance increase for users running the Elastic Search AI Platform on Arm-based architectures
Vector SearchGenerative AIIntegrations

Elasticsearch delivers performance increase for users running the Elastic Search AI Platform on Arm-based architectures

Benchmarking in preview provides up to 37% better performance on Microsoft Cobalt 100 Arm-based VMs

Yuvraj Gupta

Hemant Malik

Elastic Cloud adds Elasticsearch Vector Database optimized profile to Microsoft Azure
Vector SearchGenerative AI

Elastic Cloud adds Elasticsearch Vector Database optimized profile to Microsoft Azure

Elasticsearch adds a new vector search optimized profile to Elastic Cloud on Microsoft Azure.

Serena Chou

Jeff Vestal

Yuvraj Gupta