Guide to Building a RAG Based LLM Application

October 3, 2023

Retrieval Augmented Generation (RAG)

But we have another approach where we can augment the knowledge of LLMs and retrieve information from custom content. It is called Retrieval Augmented Generation (RAG). Utility tools like ChatPDF have been popular Generative AI tools. The PDF document is connected as an external data source and we can interact with it as we are assisted by an LLM. What we do in RAG is inserting additional data into the context (prompt) of a model at inference time. That helps the LLM get more precise and relevant content for our queries when compared to zero-shot prompting. Another way of looking at it is in the context of a doctor and patient. A doctor’s diagnosis can be significantly more precise and accurate when they have access to the patient’s test results and charts, as opposed to relying solely on symptomatic observations.

Here’s a quick step-by-step guide to building a RAG based LLM application.

The System Workflow

The workflow of the RAG based LLM application will be as follows:

Receive query from the user.
Convert it to an embedded query vector preserving the semantics, using an embedding model.
Retrieve the top-k relevant content from the vector database by computing similarity between the query embedding and the content embedding in the database.
Pass the retrieved content and query as a prompt to an LLM.
The LLM gives the required response.

Prerequisites

‍

The directory structure for the project is as shown.

‍

Ensure that you are using a Python version 3.9.0 or later. Install the following Python libraries by preparing a requirements.txt file.


langchain==0.0.279
torch==2.0.1
transformers==4.32.1
accelerate==0.22.0
sentence_transformers==2.2.2
chromadb==0.4.2
pdfminer.six
bitsandbytes
requests
bs4


$ pip install -r requirements.txt

We’ve utilized a large language model that operates efficiently on a CPU with decent performance and a minimum of 8GB RAM. However, superior specifications are recommended. If you’re considering using other large language models, a cloud-based environment like E2E cloud might be necessary.

Clone the model repository from Hugging Face to the working directory.

Make sure you have git installed on your system.


$ git lfs install
$ git clone https://huggingface.co/MBZUAI/LaMini-T5-738M

‍Configuring the Database

constants.py


import os
from chromadb.config import Settings


CHROMA_SETTINGS = Settings(
    chroma_db_impl='duckdb+parquet',
    persist_directory='db',
    anonymized_telemetry=False
)

‍Loading the Data

The retrieval knowledge base must be constructed before building the application. For this we use a vector database. In order to retrieve specific information from a document, such as a patient’s lab report, we first need to process the content of the document. This involves converting the raw data into a format that can be understood and manipulated by our system.

Once the data is processed, it is then stored in a database. However, instead of storing the data in its original form, we convert it into a mathematical representation known as an embedding. These embeddings capture the semantic meaning of the data and allow us to perform complex operations on it. For example, if we want to query information from a patient’s lab report, we search for the embedding of its contents in our vector database.

We are using Chroma DB here for simplicity. ChormaDB is an open-source, feature-rich, simple vector database for building AI applications. Check out the documentation for details.

Import libraries and load contents into the vector database.

ingest.py


from langchain.document_loaders import PyPDFLoader, PDFMinerLoader, DirectoryLoader
from langchain.embeddings import SentenceTransformerEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import Chroma
from os.path import join
import os
from constants import CHROMA_SETTINGS

Create a PDFMinerLoader object for the file. After all files have been processed, it loads the data from the last processed PDF file into the documents variable.

ingest.py


for root,dir,files in os.walk("docs"):
        for file in files:
            if file.endswith(".pdf"):
                loader = PDFMinerLoader(join(root,file))
documents = loader.load()

The document is segmented into multiple parts to simplify the search process. This method aids in the efficient retrieval of the most relevant content. We use RecursiveCharacterTextSplitter from LangChain to split the document into chunks of 500 characters with an overlap of 500 characters between each chunk.

Then, it uses the SentenceTransformerEmbeddings model “all-MiniLM-L6-v2” to generate embeddings (numerical representations) for each chunk of text.

ingest.py


textsplitter = RecursiveCharacterTextSplitter(chunk_size=500,chunk_overlap=500)
texts = textsplitter.split_documents(documents)
embeddings = SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2")

Create a Chroma object from the given texts and their corresponding embeddings. It then persists (saves) this data in the form of parquet files to a directory named “db” for future use.

ingest.py


db = Chroma.from_documents(texts, embeddings, persist_directory="db", client_settings=CHROMA_SETTINGS)

Creating LLM Object

Here we are using an open-source lightweight LLM called LaMini-T5-738M. Load the embedding model from the pretrained checkpoint. Use AutoModelForSeq2SeqLM class to load the seq2seq (or encoder-decoder) model that has a language modeling (LM) head on top.

app.py


from transformers import AutoTokenizer, pipeline, AutoModelForSeq2SeqLM
import torch
checkpoint="LaMini-T5-738M"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
base_model = AutoModelForSeq2SeqLM.from_pretrained(
    checkpoint,
    device_map="auto",
    torch_dtype=torch.float32
)

‍Create a pipeline for text-to-text generation using the specified model, tokenizer, and several parameters that control the text generation process. Adjust the temperature parameter to control the randomness of the output. Lower values make the output more deterministic. max_length sets the maximum length of the generated text.

app.py

from langchain.llms import HuggingFacePipeline


from langchain.llms  import HuggingFacePipeline
def llm_pipeline():
    pipe=pipeline(
        'text2text-generation',
        model=base_model,
        tokenizer=tokenizer,
        temperature=0.4, 
        max_length=256,
        do_sample=True,
        top_p=0.95
    )
    local_llm=HuggingFacePipeline(pipeline=pipe)
    return local_llm

Configuring the Chain

Set up a question-answering system pipeline using the language model and a retriever. db.as_retriever() creates a retriever from the Chroma database. The retriever is responsible for fetching relevant documents based on a query. In LangChain, a chain serves as a comprehensive wrapper that encompasses multiple individual components. Each command within this chain can either be a request directed towards the Large Language Model (LLM) or a function call that taps into an alternate data source. In LangChain, a “chain type” refers to the specific configuration or sequence of commands that you want the Large Language Model (LLM) to execute.

app.py


from langchain.chains import RetrievalQA
from langchain.embeddings import SentenceTransformerEmbeddings
from langchain.vectorstores import Chroma
from constants import CHROMA_SETTINGS
def qa_llm():
    llm=llm_pipeline()
    embeddings=SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2")
    db=Chroma(persist_directory="db", embedding_function=embeddings,      client_settings=CHROMA_SETTINGS)
    retriever=db.as_retriever()
    qa=RetrievalQA.from_chain_type(
      llm=llm, 
      chain_type="stuff",
      retriever=retriever,
      return_source_documents=True
    )
    return qa

Now pass the query and generate a response from the LLM.

app.py


def process_answer(instruction):
    response=''
    qa=qa_llm()
    generation=qa(instruction)
    answer=generation['result']
    return answer, generation
if name == "main":
    instruction = "Your query goes here"  # replace this with your query
  
    answer, generation = process_answer(instruction)
    print("Answer:", answer)
    print("Generation:", generation)

Add the PDF file you need to explore in the docs directory and run the Python files to prepare the database with the embeddings and then see the result.


$ python3 ingest.py


$ python3 app.py

Sample query:

Wrapping Up

And that was a simple RAG-based LLM application. You have learnt how to use LLM with RAG to generate relevant and informative answers from large-scale text corpora. Try experimenting with various chains in LangChain and build multi-PDF readers. We hope you enjoyed this tutorial and found it useful for your projects.

References

https://developer.dataiku.com/latest/tutorials/machine-learning/genai/nlp/gpt-lc-chroma-rag/index.html

https://www.trychroma.com/

https://github.com/AIAnytime/Search-Your-PDF-App

‍

Sign up for Free Trial

Latest Blogs

A vector illustration of a tech city using latest cloud technologies & infrastructure

Guide to Building a RAG Based LLM Application

October 3, 2023

Ashish Abraham

Retrieval Augmented Generation (RAG)

Here’s a quick step-by-step guide to building a RAG based LLM application.

The System Workflow

The workflow of the RAG based LLM application will be as follows:

Receive query from the user.
Convert it to an embedded query vector preserving the semantics, using an embedding model.
Retrieve the top-k relevant content from the vector database by computing similarity between the query embedding and the content embedding in the database.
Pass the retrieved content and query as a prompt to an LLM.
The LLM gives the required response.

Prerequisites

‍

The directory structure for the project is as shown.

‍

Ensure that you are using a Python version 3.9.0 or later. Install the following Python libraries by preparing a requirements.txt file.


langchain==0.0.279
torch==2.0.1
transformers==4.32.1
accelerate==0.22.0
sentence_transformers==2.2.2
chromadb==0.4.2
pdfminer.six
bitsandbytes
requests
bs4


$ pip install -r requirements.txt

Clone the model repository from Hugging Face to the working directory.

Make sure you have git installed on your system.


$ git lfs install
$ git clone https://huggingface.co/MBZUAI/LaMini-T5-738M

‍Configuring the Database

constants.py


import os
from chromadb.config import Settings


CHROMA_SETTINGS = Settings(
    chroma_db_impl='duckdb+parquet',
    persist_directory='db',
    anonymized_telemetry=False
)

‍Loading the Data

We are using Chroma DB here for simplicity. ChormaDB is an open-source, feature-rich, simple vector database for building AI applications. Check out the documentation for details.

Import libraries and load contents into the vector database.

ingest.py


from langchain.document_loaders import PyPDFLoader, PDFMinerLoader, DirectoryLoader
from langchain.embeddings import SentenceTransformerEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import Chroma
from os.path import join
import os
from constants import CHROMA_SETTINGS

Create a PDFMinerLoader object for the file. After all files have been processed, it loads the data from the last processed PDF file into the documents variable.

ingest.py


for root,dir,files in os.walk("docs"):
        for file in files:
            if file.endswith(".pdf"):
                loader = PDFMinerLoader(join(root,file))
documents = loader.load()

Then, it uses the SentenceTransformerEmbeddings model “all-MiniLM-L6-v2” to generate embeddings (numerical representations) for each chunk of text.

ingest.py


textsplitter = RecursiveCharacterTextSplitter(chunk_size=500,chunk_overlap=500)
texts = textsplitter.split_documents(documents)
embeddings = SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2")

Create a Chroma object from the given texts and their corresponding embeddings. It then persists (saves) this data in the form of parquet files to a directory named “db” for future use.

ingest.py


db = Chroma.from_documents(texts, embeddings, persist_directory="db", client_settings=CHROMA_SETTINGS)

Creating LLM Object

app.py


from transformers import AutoTokenizer, pipeline, AutoModelForSeq2SeqLM
import torch
checkpoint="LaMini-T5-738M"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
base_model = AutoModelForSeq2SeqLM.from_pretrained(
    checkpoint,
    device_map="auto",
    torch_dtype=torch.float32
)

app.py

from langchain.llms import HuggingFacePipeline


from langchain.llms  import HuggingFacePipeline
def llm_pipeline():
    pipe=pipeline(
        'text2text-generation',
        model=base_model,
        tokenizer=tokenizer,
        temperature=0.4, 
        max_length=256,
        do_sample=True,
        top_p=0.95
    )
    local_llm=HuggingFacePipeline(pipeline=pipe)
    return local_llm

Configuring the Chain

app.py


from langchain.chains import RetrievalQA
from langchain.embeddings import SentenceTransformerEmbeddings
from langchain.vectorstores import Chroma
from constants import CHROMA_SETTINGS
def qa_llm():
    llm=llm_pipeline()
    embeddings=SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2")
    db=Chroma(persist_directory="db", embedding_function=embeddings,      client_settings=CHROMA_SETTINGS)
    retriever=db.as_retriever()
    qa=RetrievalQA.from_chain_type(
      llm=llm, 
      chain_type="stuff",
      retriever=retriever,
      return_source_documents=True
    )
    return qa

Now pass the query and generate a response from the LLM.

app.py


def process_answer(instruction):
    response=''
    qa=qa_llm()
    generation=qa(instruction)
    answer=generation['result']
    return answer, generation
if name == "main":
    instruction = "Your query goes here"  # replace this with your query
  
    answer, generation = process_answer(instruction)
    print("Answer:", answer)
    print("Generation:", generation)

Add the PDF file you need to explore in the docs directory and run the Python files to prepare the database with the embeddings and then see the result.


$ python3 ingest.py


$ python3 app.py

Sample query:

Wrapping Up

References

https://developer.dataiku.com/latest/tutorials/machine-learning/genai/nlp/gpt-lc-chroma-rag/index.html

https://www.trychroma.com/

https://github.com/AIAnytime/Search-Your-PDF-App

‍

Sign up for Free Trial

Latest Blogs

Guide to Building a RAG Based LLM Application

Table of Contents

Retrieval Augmented Generation (RAG)

The System Workflow

Prerequisites

Creating LLM Object

Configuring the Chain

Wrapping Up

References

Guide to Building a RAG Based LLM Application

Table of Contents

Retrieval Augmented Generation (RAG)

The System Workflow

Prerequisites

Creating LLM Object

Configuring the Chain

Wrapping Up

References

How Does RAG Improve the Accuracy of LLM Responses?

Top 10 Cloud GPU Providers in 2025

What is Retrieval-Augmented Generation (RAG)?

AI Inference vs Training: Understanding Key Differences

Sovereign Cloud: India's Key to Digital Independence in the AI Age

E2E Sovereign Cloud Platform: Revolutionizing Cloud Sovereignty

Top 8 Generative AI Applications in 2025

A Comparison between TIR Containerized VMs vs Traditional VMs

Accelerate Your AI Application Development Using TIR Containerized VMs

The AI Revolution in the Automotive Industry: Steering Toward a Smarter, Safer, and Sustainable Future