Introduction
Let’s start with a brief introduction to Mistral 7B, Haystack, and Weaviate:
Mistral 7B
Mistral 7B is an advanced Large Language Model (LLM) developed by Mistral AI. It's part of a new generation of AI models that have revolutionized natural language processing (NLP) and understanding. Mistral 7B is known for its impressive performance, particularly in tasks involving code, mathematics, and reasoning. The model stands out for its efficiency and the ability to deliver high-quality results in various applications, including chatbots, content generation, and code completion. Its architecture includes innovative features like Sliding Window Attention and Grouped-query Attention, contributing to its speed and accuracy. The model's adaptability and versatility across different sectors have made it a popular choice in the AI community.
Haystack
Haystack, developed by Deepset, is an open-source framework for building search systems that go beyond traditional keyword-based search engines. It's designed to leverage the latest NLP technologies for more intelligent and context-aware search. Haystack allows users to build pipelines that can handle tasks like semantic search, question answering, and summarization. The framework supports various state-of-the-art NLP models and can integrate with different document stores and databases. It's particularly notable for its ability to create systems that combine retrieval methods (fetching relevant documents) with generative AI models (to synthesize answers or insights from those documents).
Weaviate
Weaviate is an open-source vector database designed for scalable and efficient similarity search. It's built to handle large amounts of data and provides real-time vector search capabilities. Weaviate is unique in its use of vector indexing to facilitate searches based on the semantic similarity of data points, making it ideal for use cases in AI and machine learning where traditional databases fall short. The database is particularly effective in applications that require searching through large sets of unstructured data, like text or images, and finding items that are semantically similar to a query. Weaviate's architecture supports various machine learning models for generating and searching through vectors, making it a versatile tool for developers working with AI and NLP applications.
In summary, Mistral 7B is a powerful language model known for its efficiency and versatility in NLP tasks, Haystack is a flexible framework for building advanced search and NLP systems, and Weaviate is a vector database optimized for semantic similarity search, making them valuable tools in the realm of AI and machine learning.
What Is Retrieval Augmented Generation (RAG)?
Retrieval Augmented Generation (RAG) is a recent breakthrough in generative AI technology that enables an LLM to generate more accurate and verifiable responses. The RAG framework works by supplementing the knowledge representation of an LLM with external sources which are specified through an input prompt. This enables the LLM to retrieve information from external, verifiable sources and to integrate the retrieved content with its generative capability to derive a customized response for a user query. RAG was first developed by Meta (https://arxiv.org/abs/2005.11401) to enhance the knowledge base available to an LLM beyond its training data.
This framework offers several advantages in improving the quality of LLM-generated responses. A major challenge with generative AI technology is that the LLMs tend to hallucinate a response when it is unsure of an answer, or when the query is ambiguous. The RAG framework reduces hallucination by grounding the LLM with additional information on specific topics. This is particularly helpful when using LLM chatbots to answer queries involving policies and enterprise-specific operations, which may otherwise not be learned from the generic training data used to pre-train the LLMs. Since the external data sources are provided by the end user, they could be considered trustworthy sources, and they also provide a reliable means to verify the quality of LLM outputs against the external data. Another advantage is that the RAG framework helps streamline the task for which the LLM application is designed, and improves the effectiveness and efficiency of its outputs by providing the external data source as a reference point. This framework also enables scalable processing of large volumes of data to glean information from them, thereby boosting efficiency in applications such as customer service, financial market analysis, and medical assistance.
Components of a RAG Framework
A RAG framework requires the following components for its implementation:
- External Data Sources: The external sources can be specified in a variety of formats such as PDF, JSON, etc, and the text data from these external sources is parsed and converted to a numerical representation known as embeddings. The relevant embeddings are then retrieved and integrated with the internal representation of an LLM.
- Embedding Model: An embedding model is used to convert the text data to numerical embeddings, also known as vectors. Typically sentence embedding models such as SentenceTransformers are used.
- Vector Document Store: The embeddings constructed from the external data sources are stored in a vector database to facilitate efficient lookup and similarity search. Several vector databases are available such as ChromaDB, Weaviate, Pinecone, Elasticsearch, etc. In particular, Weaviate is an open-source vector store that is used to demonstrate a RAG implementation in this article.
- Orchestration Pipeline: An end-to-end framework is required to extract data and embeddings, seamlessly integrate with the vector document store, apply semantic search on embeddings, and generate responses for scalable, production-ready deployment of RAG. This framework essentially consists of a pipeline that can preprocess the external data, chunk them into sizeable units, convert the text data to numerical embeddings, retrieve the embeddings per user query, and facilitate the integration of the retrieved embeddings with generative AI logic of the LLM. Popular RAG implementation frameworks are Langchain and Haystack. Haystack has an intuitive API which is more user-friendly than Langchain.
- Large Language Model: An LLM is at the heart of the RAG framework to power the generative AI mechanism. Any LLM such as GPT, Llama models, Mistral, etc could be used. This article implements RAG using the open-source Mistral 7B model. This model has outperformed the Llama 2 13B model even with significantly reduced parameter size. The reduced size further increases the efficiency in the usage of the Mistral model.
Step-by-Step Implementation of a RAG Framework
This section discusses the steps to implement a RAG framework using Mistral 7B as the LLM, Weaviate as the vector store, and Haystack as the pipeline. The utility of the RAG framework is demonstrated for gathering information about AI safety, by using the AI Risk Management Framework released by NIST (https://www.nist.gov/itl/ai-risk-management-framework) as the external data source. The code is adapted from a publicly available GitHub (https://github.com/AIAnytime/Haystack-and-Mistral-7B-RAG-Implementation), and it is modified to run the basic functionalities of an end-to-end RAG pipeline. Also, if you are looking for cloud GPUs, check out our offerings at E2E Networks. You can rent out a node for the latest NVIDIA V100 or A100 GPUs; there are many other options available as well.
The following directory structure is used, where the data/ folder stores the external data source, and the model/ folder contains the downloaded LLM model.
Install the required Python packages to run Haystack and Weaviate. This is done using the following command in a Google Colab notebook:
The requirements.txt file consists of the following packages to be installed:
Download and save the external data source to the data/ folder (refer to the directory structure shown above).
Download and save the LLM model (in our case, Mistral 7B) to the model/ folder. Several quantized versions of Mistral are available on Hugging Face, in GGUF format.
We chose the version mistral-7b-instruct-v0.1.Q4_K_S.gguf ( https://huggingface.co/TheBloke/Mistral-7B-v0.1-GGUF ).
Import the required libraries
Connect to a Weaviate database client from Colab, using the EmbeddedOptions() class.
Load external data sources, extract text, and convert them to a unified Document format. In this code, we use a PDF document as an external source. Hence, the PyPDFToDocument() converter is used. Other converters are available in the Haystack API to process formats such as HTML, JSON, etc.
Apply the built-in PreProcessor() module to clean the text, such as to remove whitespaces, header, footer, chunk data into smaller sizes, split text into words, etc.
Construct embeddings for the preprocessed text using the EmbeddingRetriever() module, and save them to the Weaviate document store. EmbeddingRetreiver takes a model name as a parameter. We use sentence transformers in this code to generate the embeddings.
Create a prompt template for user query using the PromptNode() module. We provide the LLM model as input at this stage.
Note that the model uses an invocation layer named LlamaCPPInvocationLayer. The Invocation Layer class helps to tie the external mode to the Haystack pipeline. An implementation of this class is taken from the Python file model_add,py.
As a final step, create a query pipeline with the retriever and prompt node as components, and run the pipeline to generate RAG responses.
LLM Responses for Non-RAG vs. RAG Implementation
We can observe the difference in responses of the Mistral model to the same query when the RAG implementation is used. The query ‘What is model risk management?’ elicits a generic response related to financial and stock market model risk when RAG is not used. When the LLM is grounded with external data related to the AI risk management framework, it generates a response about AI risk, with specific aspects of AI risk. This example highlights the utility of RAG.
LLM Response Without RAG:
LLM Response With RAG:
Conclusion
In summary, building and deploying a RAG system using Mistral 7B, Haystack, and Weaviate is a multifaceted process that requires careful planning, implementation, and continuous evaluation. Each step, from indexing to deployment, plays a crucial role in creating a system that is not only functional but also efficient and reliable in a production environment.
References
[1] Retrieval Augmented Generation: https://research.ibm.com/blog/retrieval-augmented-generation-RAG
[2] Video Tutorial on RAG Implementation Using Haystack and Weaviate: https://www.youtube.com/watch?v=C5mqILmVUEo