How to Build a RAG-Powered Multi-Document Q&A Chatbot Using Llama2, Streamlit and Vector Database on E2E Cloud

April 2, 2025

Open Source AI and Document Q&A

Document Q&A is an important class of problems to tackle in the field of NLP. Documents are a common form of storing information. Hence the collective value of the data available is immense. Recent advancements in open-source AI and LLMs offer great promises on what AI is capable of given the right information and enough amounts of it. Harnessing information from documents can improve open-source AI and its benefits multi-fold.

The Promise of E2E Cloud

To build this multi-document Q&A chatbot, we will leverage E2E Cloud, which offers instant access to advanced cloud GPUs like A100, and the newly launched AI supercomputer, H100, in a highly cost-effective model.

To explore, sign up at E2E Cloud, and then head to the ‘Compute’ tab. There, you will find a range of advanced GPUs on offer.

The Llama 2 LLM Model

Llama 2 is an open-source model developed by Meta which succeeds their Llama 1 LLM. The Llama 2 family consists of models of size 7 Billion, 13 Billion and 70 Billion parameters. The Llama 2 models are trained on 40% more tokens than the preceding Llama 1 with the context length of the model being much longer at 4K.

The model has been built for both text and code generation and trained on a massive corpus of text from various sources and domains, covering 101 languages. Llama 2 can perform tasks such as translation, summarization, question-answering, and text generation in a zero-shot or few-shot manner, meaning that it does not require any fine-tuning or additional data for each task or language.

Open Source Vector Databases

A vector database is a type of database that stores data as high-dimensional vectors. These vectors are mathematical representations of some attribute / feature. For example, a vector can represent human speech, textual information, and so on. Similarity searches are used in vector databases to find information.

Vector databases are increasingly becoming powerful, and have been created as a database category of their own since the growth of Generative AI. Developers now have a range of choices – ChromaDB, Weaviate, Milvus, PGVector, Qdrant, Pinecone and others.

In this tutorial, we will use Qdrant as it is not only open source, but outperforms its competitors in offering semantics-based search, recommendation systems, and so on. Qdrant supports high-dimensional data in the form of images, text, audio or video while providing production-ready vector database services with APIs to store, search and manage data. It supports complex filtering, ranking and aggregation operations on vector data while being very distributed and scalable.

Explaining RAG - Pipeline

Retrieval Augmented Generation (RAG) is a technique used in Large Language Models to increase the amount of contextual data used by the model in generating its responses. Usually, LLMs answer queries by looking at its training data and query requirements in generating responses. However this can have various limitations.

For example, ChatGPT at its launch was famously limited to just information till 2021 while generating responses. Such limitations can pose major constraints on how these models can be utilized. Such lack of information while generating responses can be solved if the entire context can be given in one prompt.

For example, if an LLM is required to answer a question based on a recent book, the entire content of the book can be given along with the question to be answered as the prompt. This, however, will require a massive context window which isn’t available with most models. This is where a RAG pipeline can help out.

RAG technique will supplement LLM models in real-time using data retrieved from vector databases. This data retrieval can be further enhanced with the best-in-class retrieval techniques so that the best contextual information with respect to the given prompt can be supplied to the model when generating a response.

In other words, in the context of our book based answering example, the vector database can fetch the most contextually relevant information from the book instead of the entire book information and feed it to the model along with the question as a prompt to generate the response as required.

What Is Streamlit?

Streamlit is an open-source Python library used to create interactive web apps for various applications of machine learning like data exploration, data visualization, etc. Data can be visualized and explored as shareable web apps with very minimal code. Streamlit is fast and simple, allowing users to have dashboards, reports, build presentations and prototypes along with maintaining all of them easily with quick deployment and updates. The service allows one to build very interactive web apps for projects while not spending too much time on web development. Streamlit supports a wide range of libraries including pandas, matplotlib, scikit-learn, PyTorch, and TensorFlow.

Configuring Node on E2E

To build our pipeline on a locally running Llama 2 chat model (7B version), I used the GDC.V100-8.120GB plan which gives me a v100 GPU and 8 CPUs paired with 120 GB of memory. The plan price was 100 INR per hour. This was paired with 50 GB of disk space giving me enough headroom to operate. This is, however, not a bare minimum requirement and was chosen by me considering some extra headroom I might require while experimenting and setting up the pipeline.

Once you launch the node on E2E Cloud, don’t forget to add your ssh keys, so that you can start with the below-mentioned steps.

Also, we recommend using VSCode’s Remote Explorer extension to turn the E2E Cloud’s GPU node into a local development environment.

Building the RAG Pipeline

Initially before building, save a set of PDFs in a directory. In our case, that would be in the folder ‘PDFs/’. All queries will be based on that. Now let us take a look at how the pipeline can be built using Llama 2 and Vector DB.

Installing Packages

Note: Protobuf 3.20.* has to be installed initially for setting up the Llama 2 chat model locally.


!pip install -U transformers accelerate einops langchain xformers bitsandbytes faiss-gpu sentence_transformers
!pip install --upgrade huggingface_hub
!pip install protobuf==3.20.*
!pip -q install PyPDF2

Obtaining Llama 2 Chat Model (7B) and the Tokenizer


from torch import cuda, bfloat16
import transformers


model_id = 'meta-llama/Llama-2-7b-chat-hf'


device = f'cuda:{cuda.current_device()}' if cuda.is_available() else 'cpu'


# set quantization configuration to load large model with less GPU memory
# this requires the `bitsandbytes` library
bnb_config = transformers.BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type='nf4',
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=bfloat16
)


# begin initializing HF items, you need an access token
hf_auth = ‘’
model_config = transformers.AutoConfig.from_pretrained(
    model_id,
    use_auth_token=hf_auth
)


model = transformers.LlamaForCausalLM.from_pretrained(
    model_id,
    trust_remote_code=True,
    config=model_config,
    quantization_config=bnb_config,
    device_map='auto',
    use_auth_token=hf_auth,
)	


tokenizer = transformers.AutoTokenizer.from_pretrained(
                        model_id,
                        use_auth_token=hf_auth)


# enable evaluation mode to allow model inference
model.eval()


print(f"Model loaded on {device}")

Setting Up a Query Pipeline

The query pipeline will be responsible for querying the Llama 2 model to obtain the response.


"""
Query Pipeline
"""


import torch


query_pipeline = transformers.pipeline(
        "text-generation",
        model=model,
        tokenizer=tokenizer,
        torch_dtype=torch.float16,
        device_map="auto",)




# Testing


def test_model(tokenizer, pipeline, prompt_to_test):
    """
    Perform a query
    print the result
    Args:
        tokenizer: the tokenizer
        pipeline: the pipeline
        prompt_to_test: the prompt
    Returns
        None
    """
    # adapted from https://huggingface.co/blog/llama2#using-transformers
    sequences = pipeline(
        prompt_to_test,
        do_sample=True,
        top_k=10,
        num_return_sequences=1,
        eos_token_id=tokenizer.eos_token_id,
        max_length=200,)
   
    for seq in sequences:
        print(f"Result: {seq['generated_text']}")


test_model(tokenizer,
           query_pipeline,
           "Who is the President of India?")

Converting the Pipeline to an Instance of HuggingFacePipeline

The HuggingFacePipeline is a part of the Langchain library, which provides an interface to Hugging Face’s Transformers library. It will be used to set up the RAG pipeline.


from langchain.llms import HuggingFacePipeline


llm = HuggingFacePipeline(pipeline=query_pipeline)
# checking again that everything is working fine
llm(prompt="How old is the previous president of USA and what will be his eldest child be 15 years from now?")

Setting Up Embeddings

Now we’ll generate embeddings for text data using a specific pre-trained model from Hugging Face’s Sentence Transformers, and it specifies that the computations should be performed on a GPU if one is available.


from langchain.embeddings import HuggingFaceEmbeddings


model_name = "sentence-transformers/all-mpnet-base-v2"
model_kwargs = {"device": "cuda"}


embeddings = HuggingFaceEmbeddings(model_name=model_name, model_kwargs=model_kwargs)

6. Managing Protobuf Installs

Existing installations of Protobuf will be deleted. Installing quadrant-client will install the latest version of Protobuf, which has to be downgraded later on.


!echo "y" | pip -q uninstall python3-protobuf
!echo "y" | pip -q uninstall protobuf
!pip install qdrant-client

Setting up the Pipeline to Read PDF Docs and Break into Chunks

All files in the PDF directory are read and the text data is read too.This is then broken into chunks, which will be vectorized later.


"""
Here we are using PDFs as a knowledge source to LLMs
"""


!pip -q install PyPDF2


# from llama_index import VectorStoreIndex, SimpleDirectoryReader
from langchain.text_splitter import CharacterTextSplitter
from PyPDF2 import PdfReader
import os


root = 'PDFs/'
text = "" # for storing the extracted text


for f in os.listdir(root):
    pdf_path = os.path.join(root, f)
    with open(pdf_path, 'rb') as file:
        pdf_reader = PdfReader(file)
        for page in pdf_reader.pages:
            text += page.extract_text()


# """
# Creating Text Chunks to divide longer text into smaller chunk using seperators
# """


text_splitter = CharacterTextSplitter(
    separator=" ",
    chunk_size=1024,
    chunk_overlap=20,
    length_function=len,)


docs = text_splitter.split_text(text)

‍

Protobuf Is Installed Again


!pip install --upgrade protobuf==4.21.7

Initializing the Vector Database


from langchain.vectorstores import Qdrant
from langchain.chains import VectorDBQA


doc_store = Qdrant.from_texts(
    docs,
    embeddings,
    path="/vectors",
    collection_name="my_documents",
)


"""Initialise chain"""


qa = VectorDBQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    vectorstore=doc_store,
)

Querying the Pipeline Based on Data in Vector DB

All files:


def test_rag(qa, query):
    print(f"Query: {query}\n")
    result = qa.run(query)
    print("\nResult: ", result)


query = "What were atomic habits?"
test_rag(qa, query)

Moving the Entire Pipeline to Steamlit

The pipeline built above has been moved to Streamlit in the following code section:


"""
With Streamlit
"""


import streamlit as st
from streamlit_extras.add_vertical_space import add_vertical_space
from langchain.vectorstores import Qdrant
from langchain.chains import VectorDBQA


import pickle
import os
#load api key lib
from dotenv import load_dotenv
import base64




#Background images add function
def add_bg_from_local(image_file):
    with open(image_file, "rb") as image_file:
        encoded_string = base64.b64encode(image_file.read())
    st.markdown(
    f"""
    
    """,
    unsafe_allow_html=True
    )


with st.sidebar:
    st.title('🦜️🔗VK - PDF BASED LLM-LANGCHAIN CHATBOT🤗')
    st.markdown('''
    ## About APP:


    The app's primary resource is utilised to create::


    - [streamlit](https://streamlit.io/)
    - [Langchain](https://docs.langchain.com/docs/)
    - [Llama 2](https://ai.meta.com/llama/)


    ''')


load_dotenv()


def main():
    st.header("Chat with your PDF!")


    #upload a your pdf file
    pdf = st.file_uploader("Upload the PDF", type='pdf')
    st.write(pdf.name)


    if pdf is not None:
        text = "" # for storing the extracted text


        for f in os.listdir(root):
            pdf_path = os.path.join(root, f)
            with open(pdf_path, 'rb') as file:
                pdf_reader = PdfReader(file)
                for page in pdf_reader.pages:
                    text += page.extract_text()


        #langchain_textspliter
        text_splitter = RecursiveCharacterTextSplitter(
            chunk_size = 1000,
            chunk_overlap = 200,
            length_function = len
        )


        chunks = text_splitter.split_text(text=text)
       
        text_splitter = CharacterTextSplitter(
        separator=" ",
        chunk_size=1024,
        chunk_overlap=20,
        length_function=len,)


        docs = text_splitter.split_text(text)
       
        #store pdf name
        store_name = pdf.name[:-4]
       
        if os.path.exists(f"{store_name}.pkl"):
            with open(f"{store_name}.pkl","rb") as f:
                vectorstore = pickle.load(f)
            #st.write("Already, Embeddings loaded from the your folder (disks)")
        else:
            #embedding (Openai methods)
            embeddings = OpenAIEmbeddings()


            #Store the chunks part in db (vector)
            vectorstore = Qdrant.from_texts(
                docs,
                embeddings,
                path="/vectors",
                collection_name="my_documents",
            )


            with open(f"{store_name}.pkl","wb") as f:
                pickle.dump(vectorstore,f)
           
            #st.write("Embedding computation completed")


        #st.write(chunks)
       
        #Accept user questions/query
       
        qa = VectorDBQA.from_chain_type(
            llm=llm,
            chain_type="stuff",
            vectorstore=doc_store,
            )
        query = st.text_input("Ask questions related to the uploaded pdf ")
        #st.write(query)


        if query:
           
            docs = qa.run(query)
            #st.write(docs)
           
            #openai rank lnv process
            llm = OpenAI(temperature=0)
            chain = load_qa_chain(llm=llm, chain_type= "stuff")
           
            with get_openai_callback() as cb:
                response = chain.run(input_documents = docs, question = query)
                print(cb)
            st.write(response)






if __name__=="__main__":
    main()

Conclusion

In this article, we have shown how to use Llama2, Streamlit and Qdrant Vector DB to build a RAG Pipeline for a multi-document Q&A chatbot. We have explained the main features and benefits of these open-source tools, and how they can facilitate document Q&A tasks. We have also demonstrated the steps to set up the environment, split and ingest documents, and query them using the RAG pipeline on Llama 2. We have presented the output of the chatbot, which can answer complex questions by retrieving relevant passages from multiple documents. We hope this article has inspired you to try out this approach and explore its potential for your own use cases.

Sign up for Free Trial

Latest Blogs