LLMs are in the spotlight now. They are a vast source of knowledge, which have changed the face of how search engines work. Information retrieval and search have become a lot easier since the debut of chatbots like ChatGPT. While the knowledge of current AI models, including ChatGPT, is confined to information up until 2021, Bing has adopted a unique approach. It enhances its understanding by extracting up-to-date information from the internet, offering a more current and comprehensive knowledge base.
Retrieval Augmented Generation (RAG)
But we have another approach where we can augment the knowledge of LLMs and retrieve information from custom content. It is called Retrieval Augmented Generation (RAG). Utility tools like ChatPDF have been popular Generative AI tools. The PDF document is connected as an external data source and we can interact with it as we are assisted by an LLM. What we do in RAG is inserting additional data into the context (prompt) of a model at inference time. That helps the LLM get more precise and relevant content for our queries when compared to zero-shot prompting. Another way of looking at it is in the context of a doctor and patient. A doctor’s diagnosis can be significantly more precise and accurate when they have access to the patient’s test results and charts, as opposed to relying solely on symptomatic observations.
Here’s a quick step-by-step guide to building a RAG based LLM application.
The System Workflow
The workflow of the RAG based LLM application will be as follows:
- Receive query from the user.
- Convert it to an embedded query vector preserving the semantics, using an embedding model.
- Retrieve the top-k relevant content from the vector database by computing similarity between the query embedding and the content embedding in the database.
- Pass the retrieved content and query as a prompt to an LLM.
- The LLM gives the required response.
Prerequisites
The directory structure for the project is as shown.
Ensure that you are using a Python version 3.9.0 or later. Install the following Python libraries by preparing a requirements.txt file.
We’ve utilized a large language model that operates efficiently on a CPU with decent performance and a minimum of 8GB RAM. However, superior specifications are recommended. If you’re considering using other large language models, a cloud-based environment like E2E cloud might be necessary.
Clone the model repository from Hugging Face to the working directory.
Make sure you have git installed on your system.
Configuring the Database
constants.py
Loading the Data
The retrieval knowledge base must be constructed before building the application. For this we use a vector database. In order to retrieve specific information from a document, such as a patient’s lab report, we first need to process the content of the document. This involves converting the raw data into a format that can be understood and manipulated by our system.
Once the data is processed, it is then stored in a database. However, instead of storing the data in its original form, we convert it into a mathematical representation known as an embedding. These embeddings capture the semantic meaning of the data and allow us to perform complex operations on it. For example, if we want to query information from a patient’s lab report, we search for the embedding of its contents in our vector database.
We are using Chroma DB here for simplicity. ChormaDB is an open-source, feature-rich, simple vector database for building AI applications. Check out the documentation for details.
Import libraries and load contents into the vector database.
ingest.py
Create a PDFMinerLoader object for the file. After all files have been processed, it loads the data from the last processed PDF file into the documents variable.
ingest.py
The document is segmented into multiple parts to simplify the search process. This method aids in the efficient retrieval of the most relevant content. We use RecursiveCharacterTextSplitter from LangChain to split the document into chunks of 500 characters with an overlap of 500 characters between each chunk.
Then, it uses the SentenceTransformerEmbeddings model “all-MiniLM-L6-v2” to generate embeddings (numerical representations) for each chunk of text.
ingest.py
Create a Chroma object from the given texts and their corresponding embeddings. It then persists (saves) this data in the form of parquet files to a directory named “db” for future use.
ingest.py
Creating LLM Object
Here we are using an open-source lightweight LLM called LaMini-T5-738M. Load the embedding model from the pretrained checkpoint. Use AutoModelForSeq2SeqLM class to load the seq2seq (or encoder-decoder) model that has a language modeling (LM) head on top.
app.py
Create a pipeline for text-to-text generation using the specified model, tokenizer, and several parameters that control the text generation process. Adjust the temperature parameter to control the randomness of the output. Lower values make the output more deterministic. max_length sets the maximum length of the generated text.
app.py
from langchain.llms import HuggingFacePipeline
Configuring the Chain
Set up a question-answering system pipeline using the language model and a retriever. db.as_retriever() creates a retriever from the Chroma database. The retriever is responsible for fetching relevant documents based on a query. In LangChain, a chain serves as a comprehensive wrapper that encompasses multiple individual components. Each command within this chain can either be a request directed towards the Large Language Model (LLM) or a function call that taps into an alternate data source. In LangChain, a “chain type” refers to the specific configuration or sequence of commands that you want the Large Language Model (LLM) to execute.
app.py
Now pass the query and generate a response from the LLM.
app.py
Add the PDF file you need to explore in the docs directory and run the Python files to prepare the database with the embeddings and then see the result.
Sample query:
Wrapping Up
And that was a simple RAG-based LLM application. You have learnt how to use LLM with RAG to generate relevant and informative answers from large-scale text corpora. Try experimenting with various chains in LangChain and build multi-PDF readers. We hope you enjoyed this tutorial and found it useful for your projects.
References
https://github.com/AIAnytime/Search-Your-PDF-App