Open Source AI and Document Q&A
Document Q&A is an important class of problems to tackle in the field of NLP. Documents are a common form of storing information. Hence the collective value of the data available is immense. Recent advancements in open-source AI and LLMs offer great promises on what AI is capable of given the right information and enough amounts of it. Harnessing information from documents can improve open-source AI and its benefits multi-fold.
The Promise of E2E Cloud
To build this multi-document Q&A chatbot, we will leverage E2E Cloud, which offers instant access to advanced cloud GPUs like A100, and the newly launched AI supercomputer, H100, in a highly cost-effective model.
To explore, sign up at E2E Cloud, and then head to the ‘Compute’ tab. There, you will find a range of advanced GPUs on offer.
The Llama 2 LLM Model
Llama 2 is an open-source model developed by Meta which succeeds their Llama 1 LLM. The Llama 2 family consists of models of size 7 Billion, 13 Billion and 70 Billion parameters. The Llama 2 models are trained on 40% more tokens than the preceding Llama 1 with the context length of the model being much longer at 4K.
The model has been built for both text and code generation and trained on a massive corpus of text from various sources and domains, covering 101 languages. Llama 2 can perform tasks such as translation, summarization, question-answering, and text generation in a zero-shot or few-shot manner, meaning that it does not require any fine-tuning or additional data for each task or language.
Open Source Vector Databases
A vector database is a type of database that stores data as high-dimensional vectors. These vectors are mathematical representations of some attribute / feature. For example, a vector can represent human speech, textual information, and so on. Similarity searches are used in vector databases to find information.
Vector databases are increasingly becoming powerful, and have been created as a database category of their own since the growth of Generative AI. Developers now have a range of choices – ChromaDB, Weaviate, Milvus, PGVector, Qdrant, Pinecone and others.
In this tutorial, we will use Qdrant as it is not only open source, but outperforms its competitors in offering semantics-based search, recommendation systems, and so on. Qdrant supports high-dimensional data in the form of images, text, audio or video while providing production-ready vector database services with APIs to store, search and manage data. It supports complex filtering, ranking and aggregation operations on vector data while being very distributed and scalable.
Explaining RAG - Pipeline
Retrieval Augmented Generation (RAG) is a technique used in Large Language Models to increase the amount of contextual data used by the model in generating its responses. Usually, LLMs answer queries by looking at its training data and query requirements in generating responses. However this can have various limitations.
For example, ChatGPT at its launch was famously limited to just information till 2021 while generating responses. Such limitations can pose major constraints on how these models can be utilized. Such lack of information while generating responses can be solved if the entire context can be given in one prompt.
For example, if an LLM is required to answer a question based on a recent book, the entire content of the book can be given along with the question to be answered as the prompt. This, however, will require a massive context window which isn’t available with most models. This is where a RAG pipeline can help out.
RAG technique will supplement LLM models in real-time using data retrieved from vector databases. This data retrieval can be further enhanced with the best-in-class retrieval techniques so that the best contextual information with respect to the given prompt can be supplied to the model when generating a response.
In other words, in the context of our book based answering example, the vector database can fetch the most contextually relevant information from the book instead of the entire book information and feed it to the model along with the question as a prompt to generate the response as required.
What Is Streamlit?
Streamlit is an open-source Python library used to create interactive web apps for various applications of machine learning like data exploration, data visualization, etc. Data can be visualized and explored as shareable web apps with very minimal code. Streamlit is fast and simple, allowing users to have dashboards, reports, build presentations and prototypes along with maintaining all of them easily with quick deployment and updates. The service allows one to build very interactive web apps for projects while not spending too much time on web development. Streamlit supports a wide range of libraries including pandas, matplotlib, scikit-learn, PyTorch, and TensorFlow.
Configuring Node on E2E
To build our pipeline on a locally running Llama 2 chat model (7B version), I used the GDC.V100-8.120GB plan which gives me a v100 GPU and 8 CPUs paired with 120 GB of memory. The plan price was 100 INR per hour. This was paired with 50 GB of disk space giving me enough headroom to operate. This is, however, not a bare minimum requirement and was chosen by me considering some extra headroom I might require while experimenting and setting up the pipeline.
Once you launch the node on E2E Cloud, don’t forget to add your ssh keys, so that you can start with the below-mentioned steps.
Also, we recommend using VSCode’s Remote Explorer extension to turn the E2E Cloud’s GPU node into a local development environment.
Building the RAG Pipeline
Initially before building, save a set of PDFs in a directory. In our case, that would be in the folder ‘PDFs/’. All queries will be based on that. Now let us take a look at how the pipeline can be built using Llama 2 and Vector DB.
- Installing Packages
Note: Protobuf 3.20.* has to be installed initially for setting up the Llama 2 chat model locally.
- Obtaining Llama 2 Chat Model (7B) and the Tokenizer
- Setting Up a Query Pipeline
The query pipeline will be responsible for querying the Llama 2 model to obtain the response.
- Converting the Pipeline to an Instance of HuggingFacePipeline
The HuggingFacePipeline is a part of the Langchain library, which provides an interface to Hugging Face’s Transformers library. It will be used to set up the RAG pipeline.
Setting Up Embeddings
Now we’ll generate embeddings for text data using a specific pre-trained model from Hugging Face’s Sentence Transformers, and it specifies that the computations should be performed on a GPU if one is available.
6. Managing Protobuf Installs
Existing installations of Protobuf will be deleted. Installing quadrant-client will install the latest version of Protobuf, which has to be downgraded later on.
- Setting up the Pipeline to Read PDF Docs and Break into Chunks
All files in the PDF directory are read and the text data is read too.This is then broken into chunks, which will be vectorized later.
- Protobuf Is Installed Again
- Initializing the Vector Database
- Querying the Pipeline Based on Data in Vector DB
All files:
Moving the Entire Pipeline to Steamlit
The pipeline built above has been moved to Streamlit in the following code section:
Conclusion
In this article, we have shown how to use Llama2, Streamlit and Qdrant Vector DB to build a RAG Pipeline for a multi-document Q&A chatbot. We have explained the main features and benefits of these open-source tools, and how they can facilitate document Q&A tasks. We have also demonstrated the steps to set up the environment, split and ingest documents, and query them using the RAG pipeline on Llama 2. We have presented the output of the chatbot, which can answer complex questions by retrieving relevant passages from multiple documents. We hope this article has inspired you to try out this approach and explore its potential for your own use cases.