Introduction
In today's data-driven world, enterprises are constantly seeking innovative solutions to manage and leverage their vast repositories of content efficiently. Retrieval Augmented Generation (RAG) pipelines represent a cutting-edge approach to content retrieval and generation by combining the power of large language models with information retrieval techniques. Mamba is a state-of-the-art selective state model (SSM), and enterprises can leverage it to construct robust RAG pipelines tailored to their specific needs. It can revolutionize how they interact with and derive insights from their content.
RAG pipelines integrate two critical components: retrieval and generation. The retrieval component sifts through the enterprise's content repositories to identify relevant information based on user queries or prompts. This retrieved information serves as context for the generation component, which employs advanced language models like Mamba to generate accurate and contextually relevant responses, summaries, or insights.
For example, Ed-Tech industries can harness the Mamba model for enterprise content generation in a very impactful way. Their platforms can utilize the RAG pipeline to create personalized learning materials for students. By retrieving relevant educational resources from vast content repositories, such as textbooks, academic journals, and online courses, and generating explanations, summaries, or practice questions, educators can provide students with highly targeted and engaging learning experiences.
Leveraging Mamba for the RAG pipeline can support content localization efforts by retrieving educational resources in multiple languages and generating translations, adaptations, or culturally relevant examples for diverse student populations. This capability enables EdTech platforms to expand their reach and deliver high-quality educational content to learners worldwide.
Let’s see how we can leverage the Mamba model for Ed-tech enterprise content generation.
Why Leverage E2E Networks’ Cloud GPUs?
Mamba, particularly when used in complex tasks like Retrieval Augmented Generation (RAG), can be computationally intensive. Cloud GPUs offer high-performance computing capabilities that can handle the computational demands of running Mamba efficiently. They enable parallel processing of tasks, which is crucial for speeding up the execution of Mamba, especially when dealing with large datasets or complex models. This parallelization can lead to faster inference times and more responsive systems.
In a Retrieval Augmented Generation pipeline, where the workload may vary over time or with different datasets, the ability to dynamically adjust GPU resources ensures optimal performance and cost-effectiveness. Mamba may require significant memory resources, especially when processing large datasets or models, which can be easily handled using Cloud GPU resources. While cloud GPU services involve operational costs, they can be more cost-effective than purchasing and maintaining dedicated hardware, particularly for variable workloads. Users pay only for the resources they consume, which makes it a cost-efficient option for running Mamba-based RAG pipelines.
This is the place where E2E Networks comes into the picture. E2E Networks provides a variety of Cloud GPUs which you can see here in the Product list. The Cloud GPUs are affordable and highly advanced. To get started, create your account on E2E Networks’ My Account portal. Login to your E2E account. Set up your SSH keys by visiting Settings.
After creating the SSH keys, visit Compute to create a node instance.
Open your Visual Studio code, and download the extension Remote Explorer and Remote SSH. Open a new terminal. Login to your local system with the following code:
With this, you’ll be logged in to your node.
RAG Pipeline Using Mamba
Install the dependencies needed to make a RAG pipeline using the Mamba model.
As we are going to make a RAG pipeline for Ed-Tech Industries, let’s try the Cosmopedia dataset having an OpenStax subset. Using ‘datasets’, download the dataset and store it in the CSV file.
Then, using Text Splitter, we will split documents into chunks.
Now, we’ll create embeddings of the texts that we got after text splitting. We will use FastEmbed for creating embeddings. To know more about its models, visit here.
The embeddings are ready; we need to save them in a vector database so that the retrieval will be easy. Here, we are using the Qdrant vector database. It can store the embeddings in the in-memory storage. We named the collection ‘Edtech’.
Now, we need to prepare the model. But, before we prepare the model, let’s know what Mamba is.
Mamba
Mamba is a state-of-the-art sequence modeling architecture designed to handle a wide range of tasks that require understanding and generation of sequential data. Mamba is developed on the foundation of Selective State Machines (SSM); it leverages the concept of SSMs, which enables it to selectively remember and utilize relevant information while discarding unnecessary details. Mamba excels in autoregressive language modeling tasks, which demonstrates a competitive performance compared to established Transformer architectures.
Model Components
- Linear Projections: Mamba utilizes linear projections to transform input sequences into a suitable representation for processing.
- Selective Mechanism: The selective mechanism in Mamba, facilitated by the SSM architecture, allows the model to focus on relevant information and discard irrelevant details efficiently.
- Attention Mechanism: Mamba incorporates attention mechanisms to capture dependencies between different parts of the input sequence, particularly useful for tasks requiring context understanding.
Let’s prepare the model. We’ll use the mamba model by state spaces in this blog post.
Then, we will use the GPT-NEOX 20B model for tokenizing. We took this idea from the Mamba paper.
Then, we will define a function using asyncio, where we will first encode the prompt and get the input_ids. Then, we will initialize the retriever. After that, we will use the retriever to get similar documents. As our data is the list of documents, we will extract the texts using the page_content. After that, we will use the model to generate the output and decode it using the tokenizer. We’ll pass a query to generate the content; let’s see the response:
The following will be the response:
Conclusion
As we saw, Mamba works slightly differently from the approach we are used to when using Transformer-based models. Mamba models improve upon Transformers and are more efficient. In this article, we demonstrated steps to build a RAG pipeline using Mamba and vector databases. We expect to see more SSM-based architectures emerge in the near future.