Steps to Build RAG Pipeline with Cohere For AI’s Aya LLM

April 2, 2025

Aya 101 is a state-of-the-art, open-source, massively multilingual large language model (LLM) developed by Cohere for AI. It has the remarkable capability of operating in 101 different languages, including over 50 that are considered underserved by most advanced AI models.

In this article, we will go through a step-by-step process of deploying and using the Aya model. We will also build a FAISS powered RAG pipeline using Aya, and showcase how enterprises can use this for building AI applications.

The Aya 101 Model by Cohere for AI

Aya 101 Model by Cohere for AI project is part of an open science endeavor, and is a collaborative effort involving contributions from people across the globe.

Aya's goal is to address the imbalance in language representation within AI by developing a model that understands and generates multiple languages, not just the ones that are predominantly represented online.

Key Facts about Aya

Massively Multilingual: The model supports 101 languages. It also includes over 50 languages rarely seen in AI models.
Open Source: The model, training process, and datasets are all open source.
Groundbreaking Dataset: Aya comes with the largest multilingual instruction dataset released till date, comprising 513 million data points across 114 languages.

Source: Cohere for AI

The need for such a project arises from the fact that while a significant portion of internet content is in English, there are approximately 7,000 languages spoken worldwide. However, many AI models do not support the majority of these languages, which can lead to a lack of access to technology for speakers of underrepresented languages. Aya seeks to change this by improving AI's multilingual capabilities, making it more inclusive.

Cohere for AI’s Aya initiative has contributions from everyday citizens, educators, linguists, and anyone interested in language technology. By participating, individuals helped democratize access to language technology and ensure broader language representation in the AI space.

For more detailed information, you can read about Cohere's Aya on their website.

Understanding RAG Pipeline

The Retrieval-Augmented Generation (RAG) pipeline has become a powerful tool in the field of LLMs. At its core, the RAG pipeline combines two crucial steps:

Retrieval step: Retrieving relevant stored information using Vector Search or Knowledge Graph or simple search.

Generation step: Generating coherent text using a combination of contextual knowledge and natural language generation capabilities of LLMs.

This combination allows the system to pull in essential details from a database and then use them to construct detailed and informative responses to user queries.

This helps ‘ground’ the LLMs in facts, and helps it with the context or knowledge it needs to respond to user queries.

This is very powerful for enterprise applications for a variety of reasons. Imagine you're asking a complex question that requires specific knowledge. The RAG pipeline first searches through a large collection of documents to find the pieces of information most related to your question.

Then, using a language model, it takes that information and crafts a reply that feels both precise and human-like. The beauty of the RAG pipeline lies in its ability to provide answers that aren't just generic; they are customized and informed by the retrieved data, making the responses more accurate and trustworthy.
This makes RAG pipelines incredibly important for building intelligent chatbots, search engines, and help desks that can assist users with detailed and contextually relevant information.

FAISS As Vector Store

FAISS, which stands for Facebook AI Similarity Search, is a library developed by Facebook AI that enables efficient similarity search. It provides algorithms to quickly search and cluster embedding vectors, making it suitable for tasks such as semantic search and similarity matching.

FAISS can handle large databases efficiently and is designed to work with high-dimensional vectors, allowing for fast and memory-efficient similarity search.

In this article, we will use FAISS as our Vector Store, which will provide context to the Aya LLM. We will also use LangChain for building the pipeline.

Step-by-Step Guide to Building a RAG Pipeline with Aya

Choosing a GPU node

The code in this article was hosted on a V100 GPU node provided by E2E Networks. E2E Networks offers a variety of cloud GPU nodes designed to cater to different computational needs during AI model training and inference.

Our offerings also include powerful servers such as the HGX 8xH100 and HGX 4xH100, which integrates H100 GPUs with high-speed interconnects, ideal for demanding tasks like high-performance computing and machine learning.

The best part is, all our cloud GPUs come with optimized and integrated software stacks, including TensorFlow, GPU drivers, and CUDA, to facilitate a wide range of applications and workloads efficiently.

To start with, sign up for an account here. After that, you can launch a V100 node from the ‘Compute’ tab on the sidebar.

To set up Aya, you need to first import the required modules.


from torch import cuda, bfloat16
import torch
import transformers
from transformers import AutoTokenizer
from langchain.llms import HuggingFacePipeline

Then set up the quantization config.


bnb_config = transformers.BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type='nf4',
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=bfloat16
)

Load the model and the tokenizer.


from transformers import AutoModelForSeq2SeqLM, AutoTokenizer


checkpoint = "CohereForAI/aya-101"


tokenizer = AutoTokenizer.from_pretrained(checkpoint)
aya_model = AutoModelForSeq2SeqLM.from_pretrained(checkpoint, quantization_config=bnb_config,)

Create a query pipeline.


query_pipeline = transformers.pipeline(
        "text2text-generation",
        model=aya_model,
        tokenizer=tokenizer,
        torch_dtype=torch.float16,
        device_map="auto",
        max_length = 512,
        early_stopping=True,
        num_return_sequences=1,
        no_repeat_ngram_size=2,


)

Now let’s try to generate responses from Aya in different languages.


query_pipeline("Describe the state of Rajasthan in Hindi")

[{'generated_text': 'Rajasthan एक राज्य है जो भारत के उत्तर-पश्चिम में स्थित है। यह राजस्थान राज्य के दक्षिण में है और राजसमंद जिले के पूर्व में। राज्य में राजधानी अजमेर है, जो राजशाही शहर है जिसे राजा जयप्रकाश मौर्य ने स्थापित किया था। राजस् थान में लगभग तीन करोड़ लोग रहते हैं।'}]

Translation: ‘Rajasthan is a state located in the north-west of India. It is situated to the south of the Rajasthan state and to the east of the Rajasmand district. The capital of the state is Ajmer, which is a royal city established by King Jayaprakash Maurya. Approximately three crore people live in the state.’


query_pipeline("How to make Baklava? Give me the recipe in Turkish")

[{'generated_text': 'Baklava, tatlı olarak da bilinen çikolata soslu bir tatlıdır. İçine fıstık ezmesi, tarçın, şeker ve tarçını karıştırarak yapılır. Ayrıca yumurta, süt ve şekeri de karıştırabilirsiniz.'}]

Translation: ‘Baklava, also known as a dessert called sweet, is a dessert with chocolate sauce. It is made by mixing peanut butter, cinnamon, sugar, and cinnamon. You can also mix eggs, milk, and sugar.’


query_pipeline("How to make an igloo? Answer in Icelandic")

[{'generated_text': 'Hér er leiðbeiningar um hvernig á að búa til iglú: Fyrsta. Veldu efni: Veltu eitthvað sem þú vilt nota til að hýsa ígluna þína. Þetta gæti verið t.d. ís, snjó eða vatn. 2. Að hafa gott samband. Búðu til stað: Búið til stóran garð nálægt vatni ūar sem iglunin verđur. Setjið þar ýmsar plötur og greinar, svo sem tré, tré og tré. 3. Að búið er að setja bygginguna: Setja upp allar nauðsynlegar búnaður og setjað þá aftur efst án þess að þurfa að fjarlægja neitt. 4. Búa um ytri hluta íglúunnar: Þú getur búist við að nota ýmis tæki og verkfæri til þessa. Til dæmis, það er hægt að stilla hitastig gróðurs, hita olíu og hita lofts. Fimm. Fylgstu með henni: Fylgist vel með því að fylgjast með þörfum iðgunnar þinnar og fylgist með þeim breytingum sem þarf að gera. 6. Búđu til innri rýmið: Byrjađ ađ búđa inn nũjan herbergi hjá ūríđum. Notaðu mismunandi tegundir af dýrum, eins og dúk, dúkur og rúm. 7. Búða til útsýnið: Notađiđ útvarpstæki til ad horfa beint ur húsiđ. 8. Búðiđ til loftið og veggina: notađir rýmiđ sem er ekki mjög heitt og lofti. 9. Búðið til gluggana: Ýttu fingrum og tækjum gegn gluggunum og gluggum. Tíu. Bæta við lýsingu: Bætið við ljósi og ljósum til hússins til viðbótar við þær breytingar sem eru gerðar strax. 11 ára. Settu skjól: Settiđ svefnherbergiđ og svæđin hvort annađ undir skýliđ, til dæmi'}]

Translation: Here are instructions on how to build an igloo:

1. Choose material: Choose something to use to house your igloo. This could be ice, snow, or water, for example.

2. Have good communication.

3. Create a location: Create a large yard near water where the igloo will be. Place various plates and branches there, such as wood, trees, and trees.

4. Building the structure: Set up all necessary equipment and put them back on top without having to remove anything.

5. Build the outer part of the igloo: You can expect to use various tools and equipment for this. For example, it is possible to adjust the temperature of the greenhouse, heat oil, and heat air.

6. Follow it: Monitor the needs of your igloo and follow the changes that need to be made.

7. Create the inner space: Start by creating a cozy room with different types of bedding, such as canvas, cloth, and carpet.

8. Create the view: Use radio equipment to watch directly from the house.

9. Build the ceiling and walls: Use space that is not very hot and airy.

10. Make windows: Push fingers and tools against the windows and windows.

11. Add lighting: Add lights and lamps to the house in addition to the changes made immediately.

12. Set up shelter: Set up the bedroom and space either under the shelter, for example


query_pipeline("Write me a poem in Maltese")

[{'generated_text': "Ħej, ħbieb, I hawn biex ngħinuk b'kull mod li nista'. Jien l- ewwel u jien aħdar, Għalhekk jekk jogħġbok għidli x'tixtieq. I se jkun qed iħossu dwar dan, U huwa żmien tajjeb biżżejjed biża' bieb. Għandi t-tendenza li nibqa' lura, Jitlob grazzi, Allura ejja nikbru fuq dawn iż-żewġ affarijiet."}]

Translation: ‘Hi, friends, I'm here to help you in any way I can. I am the first and I am green, So please tell me what you want. He will be feeling this way, And it's a very good time with a closed door. I tend to stay back, Ask thanks, So let's grow on these two things.’

As we can see from the above responses, Aya, even though it can generate responses in multiple languages, is at a nascent stage as far as the quality of the responses are concerned.

Setting Up a RAG Pipeline with Gradio

Import the necessary modules.


from langchain.document_loaders import TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.chains import RetrievalQA
from langchain_community.vectorstores import FAISS

Define a text splitter to break down the uploaded documents into smaller chunks.


# Simulate some document processing delay
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=20,
length_function=len,
is_separator_regex=False,
)

Load an embedding model to vectorize the text in the document.


model_name = "sentence-transformers/all-mpnet-base-v2"
model_kwargs = {"device": "cuda"}


embeddings = HuggingFaceEmbeddings(model_name=model_name, model_kwargs=model_kwargs)

Define a function to create a question-answering chain from the uploaded documents.


import gradio as gr
import os
from shutil import copyfile


def create_retrieval_chain(files):


    docs = []


    for file_path in files:


        if file_path.lower().endswith('.pdf'):  # Check if the file is a PDF
            loader_temp = PyPDFLoader(file_path)
            docs_temp = loader_temp.load_and_split(text_splitter=text_splitter)
            docs += docs_temp


        else:
            return (f"Please upload PDF files only")


    for doc in docs:
        doc.page_content = doc.page_content.replace('\n', ' ')


    vectordb = FAISS.from_documents(documents=docs, embedding=embeddings)
    retriever = vectordb.as_retriever()


    global qa


    qa = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=retriever,
    verbose=True
    )


    return f"Process PDF files. They can be queried now"

Define another function to answer the queries based on context retrieved from the documents.


def process_query(query):
    response = qa.invoke(query)
    return response

Now launch a Gradio interface. Make sure you set the host to be 0.0.0.0 and open the port at 7865, so that the application can be accessed externally.

You can do so by running the following on your server’s debian terminal.


sudo iptables -A INPUT -p tcp --dport 7865 -j ACCEPT


sudo iptables-save | sudo tee /etc/iptables/rules.v4

Then launch Gradio.


# Define the Gradio interface
iface_save_pdf = gr.Interface(fn=create_retrieval_chain,
                     inputs=gr.Files(label="Upload Files", type='filepath'),
                     outputs="text",
                     title="PDF Uploader",
                     description="Upload multiple files. Only PDF files will be saved to disk.")


iface_process_query = gr.Interface(fn=process_query,
                                   inputs=gr.Textbox(label="Enter your query"),
                                   outputs="text",
                                   title="Query Processor",
                                   description="Enter queries to get responses.")


iface_combined = gr.TabbedInterface([iface_save_pdf, iface_process_query], ["PDF Upload", "Query Processor"])


# Launch the combined interface
if __name__ == "__main__":
    iface_combined.launch(server_name='0.0.0.0', server_port=7865, share=True)

The interface has two tabs. One tab is for uploading the pdf documents and the other is for querying it. I’m going to upload a document titled ‘Why are E2E Cloud Solutions Lower in Pricing Than Competitors?’. I downloaded a pdf version of this article from here.

‍

Now let’s query the document using the other tab.

‍

Conclusion

Aya model presents a groundbreaking new capability in LLMs to handle multilingual queries. In the coming future, we believe that LLMs like Aya will transform how we communicate, and how enterprise applications build customer experiences.

If you want to learn more about how to deploy and use Aya model, reach out to us at sales@e2enetworks.com.

Sign up for Free Trial

Latest Blogs