Building a Retrieval Augmented Generation (RAG) Chatbot
In this blog, we will explore how to build a Retrieval Augmented Generation (RAG) Chatbot (GenAI Chat) that allows users to interact with their data. By pairing a large language model (LLM) with a vector database, the Chatbot delivers highly contextual and relevant responses to user queries.
What is Retrieval Augmented Generation (RAG)?
Retrieval Augmented Generation (RAG) is an AI technique that enhances the capabilities of large language models by augmenting them with external knowledge. Instead of relying solely on the pre-trained knowledge of an LLM, RAG retrieves relevant information from a database or knowledge store to provide contextually accurate responses. This approach helps:
- Handle domain-specific queries.
- Improve accuracy by grounding responses in factual data.
- Reduce hallucination (a common problem in LLMs where they generate incorrect or irrelevant information).
Let’s build a Chatbot using RAG
The chatbot consists of these core components:
- Frontend: It takes user queries and sends them to the backend. It’s built with HTML + JavaScript and runs in a Docker container with Nginx.
- Backend: It takes user queries, fetches relevant documents from Redis Vector DB, builds prompts, and sends them to the LLM for response generation. It’s built with Flask and runs in a Docker container.
- Redis Vector Database: It stores the document text, embedding vectors, and session data. It also runs in a Docker container.
- OpenAI LLM: Takes prompt and generates response. We will be using the
gpt-4o
model for generating response and thetext-embedding-3-small
model for generating embedding vectors (embedding dimension1536
). These models are hosted in the cloud and Chatbot makes API calls to communicate with the models.
System Architecture
Data indexing
Users can index documents into the Redis Vector Database from the below 3 sources:
- Local Files
- Azure Blob Storage
- AWS S3 Buckets
Once documents are uploaded, they go through a chunking process that breaks large documents into multiple smaller documents (chunks). Then embedding vector for each chunk is generated using OpenAI’s text embedding model. Finally, each chunk’s text and its embedding vector are stored in the Redis Vector Database.
Below is the index schema for index in Redis Vector Database which stores several pieces of information for each document chunk, including the index name, filename, chunk id, text content, source details, tags, and embedding vector.
def create_index(index_name):
try:
# check to see if index exists
redis_client.ft(index_name).info()
print("Index already exists!")
except:
# schema
schema = (
TextField("index_name"), # Text Field
TextField("file_name"), # Text Field
TextField("chunk"), # Text Field
TextField("content"), # Text Field
TextField("source"), # Text Field
TagField("tag"), # Tag Field Name
VectorField("vector", # Vector Field Name
"FLAT", { # Vector Index Type: FLAT or HNSW
"TYPE": "FLOAT32", # FLOAT32 or FLOAT64
"DIM": EMBEDDING_DIMENSIONS, # Number of Vector Dimensions
"DISTANCE_METRIC": "COSINE", # Vector Search Distance Metric
}
),
)
# index Definition
definition = IndexDefinition(prefix=[DOC_PREFIX], index_type=IndexType.HASH)
# create Index
redis_client.ft(index_name).create_index(fields=schema, definition=definition)
print(f"Index {index_name} created!")
Interacting with Chatbot
Once documents are indexed user can check them using RedisInsight and then start interacting with the chatbot via the frontend.
- When a user submits a question, the frontend sends a
POST
request to the backend with a question, objective, index name, session id, etc. - Users can update the objective and index name from the
Settings
page. - Session ID is automatically generated when the page is loaded. It resets either upon page reload or when the
Clear Chat
button is clicked. - The backend receives the question and queries Redis to retrieve the most relevant documents based on vector similarity.
- Few top-ranked documents are consolidated into a single context.
- This context, along with the user’s question and objective, is encapsulated into a message object (shown below) and sent to OpenAI’s LLM.
- The LLM processes the input and generates a response.
- To ensure transparency and traceability, the backend enriches the response with the filenames of the source documents.
- The final response is returned to the frontend and displayed to the user.
Message history
Large Language Models (LLMs) are inherently stateless, meaning they do not retain the memory of previous interactions. To enable a chatbot with multi-turn conversational capabilities, it is crucial to provide the LLM with past conversations.
At the start of a conversation, the message object is structured as follows. This final message object is then sent to OpenAI to generate a response.
messages = [
{"role": "system", "content": objective},
{"role": "system", "content": document_context},
{"role": "user", "content": question}
]
Each time the LLM generates a response, it is appended to the message object. The updated message object is then stored in Redis using a unique session ID. After several interactions, the message object might look like this:
messages = [
{"role": "system", "content": objective},
{"role": "system", "content": document_context},
{"role": "user", "content": question},
{"role": "assistant", "content": response},
{"role": "user", "content": question},
{"role": "assistant", "content": response},
{"role": "user", "content": question}
...
]
Note:
Past conversations will be removed from the message object based on the token limit specified in the
backend/common/config.py
. This is necessary to maintain the message object within a defined size and to ensure that the LLM focuses more on recent conversations.The
document_context
is removed from the message object before saving it in Redis. It is dynamically updated for each new question to ensure the latest context is provided to the LLM.The message objects are retained in Redis for 15 minutes, after which they are automatically purged.
Classifying LLM Responses for Source Exclusion
In certain types of LLM responses, such as greeting messages and thank you messages, it is unnecessary to display source information in the final output. To address this, we use the DistilBERT model to classify LLM responses into 3 specific categories: Greeting messages, Thank You messages, and Bad messages. Below are examples of each category:
Greeting messages:
Example: Hi, how can I assist you today?
Thank you messages:
Example: I’m glad that was helpful! Let me know if there’s anything else I can assist you with.
Bad messages:
Example: I’m sorry, but I need more clarity on your question. Can you provide more details?
If an LLM response is classified into any of these categories, the source information will be omitted from the final output.
Note: This feature is experimental and may not always work as expected, users can enable/disable this feature from the Settings page.
Please feel free to explore the complete code in this GitHub repository
Conclusion
In this blog, we discussed how to design and build a Retrieval Augmented Generation (RAG) based chatbot that combines a large language model with a vector database to provide accurate, context-aware responses to user queries.
We explored the process of indexing data from multiple sources such as local files, Azure Blob Storage, and AWS S3, as well as the chunking and embedding steps that ensure the data is efficiently stored and accessed. This approach not only improves accuracy but also reduces the risk of hallucinations, making the chatbot a reliable tool for handling domain-specific queries.
Thank you for taking the time to read this article! I hope it offered some valuable insights. If you found it helpful, feel free to clap, share it with others, or support my work with a coffee!
If you have any questions or thoughts, feel free to leave a comment. You can follow me on Medium, LinkedIn, GitHub or contact me at atinesh.s@gmail.com.