Step-by-Step Guide to Building a Speech and Voice AI Assistant Using ASR with Llama 3.1

April 3, 2025

In today's digital age, AI-powered voice assistants have become necessary for e-commerce businesses to improve customer interactions. These voice assistants are used to provide real-time, personalized responses, enabling companies to offer an interactive and better shopping experience.

By adding advanced voice recognition technologies like Wav2Vec2 which are open source, these assistants can accurately interpret spoken language through which customers can interact more naturally and smoothly. This integration also streamlines the buying process, making it easier for customers to find what they need through simple voice commands.

In this article, we will guide you on how you can build an AI-powered voice assistant for e-commerce. We will be focusing on integrating Wav2Vec2 for enhanced user interaction, setting up necessary tools and integrating them with a better interface, showing how you can create your own responsive, voice-enabled AI-powered virtual assistant that can satisfy your customers' needs 24/7.

What Is Wav2Vec2

Wav2Vec2 is a state-of-the-art automatic speech recognition (ASR) model developed by Meta AI. Unlike traditional speech recognition systems that rely heavily on pre-defined features and manual labeling, Wav2Vec2 uses unsupervised learning to understand the structure of speech, making it more accurate and robust across different accents and languages.

The need for Wav2Vec2 arises from the increasing demand for voice-based applications where users prefer to interact using natural speech instead of typing. Using Wav2Vec2, our AI Assistant can transcribe audio input into text, helping the system understand and process spoken queries. This capability is essential for applications where users might be busy, have difficulty typing, or prefer the convenience of voice commands.

How to Use Wav2Vec2

We will first showcase the steps to use Wav2Vec2. After that, we will show how to build a Voice AI assistant in the e-commerce domain that uses RAG architecture to provide contextual outputs.

To get started, first sign up to E2E Cloud and launch a cloud GPU node. E2E Cloud offers the most price-performant cloud GPUs in the Indian market. You get blazing-fast performance, at a price point that’s far cheaper than AWS, GCP or Azure. Check the pricing page to learn more.

When launching the cloud GPU, make sure you add your public SSH key (id_rsa.pub). This will allow you to remote SSH into the node in the following way:


$ ssh root@

Once you have logged in, create a user using adduser command, and add the user to the sudo-ers list using visudo.


$ adduser username
$ visudo

You can now create a Python virtual environment.


$ python3 -m venv .env
$ source .env/bin/activate

Then install the dependencies:


$ pip install torch transformers librosa
$ pip install jupyterlab

You can now install Jupyter Lab and then use that to build this example:


$ pip install jupyterlab
$ jupyter lab

Testing Wav2Vec2

Let’s see how Wav2Vec2 works:


import torch
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
import librosa

Set Device as GPU.


device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

Initialize Model and processor.


model = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-large-960h").to(device)processor = Wav2Vec2
Processor.from_pretrained("facebook/wav2vec2-large-960h")

Select path of audio file and input values.


audio, rate = librosa.load("output.wav", sr=16000)
input_values = processor(audio, return_tensors="pt", padding="longest", sampling_rate=16000).input_values.to(device)

Generate text output.


with torch.no_grad():
   logits = model(input_values).logits
predicted_ids = torch.argmax(logits, dim=-1)
transcription = processor.batch_decode(predicted_ids)
print(transcription[0])

What Is Parler TTS

Along with ASR, we will also use a TTS (text-to-speech) model. Parler TTS is an advanced, open-source text-to-speech (TTS) model developed to generate high-quality, natural-sounding speech with a high degree of control over various features such as gender, pitch, speaking style, and background noise. Leveraging the power of an auto-regressive transformer-based architecture, Parler TTS generates speech by creating audio tokens in a causal manner, allowing for real-time streaming of audio as it is produced. This reduces latency significantly, providing near-instantaneous audio output when using modern GPUs.

The model supports efficient attention mechanisms, like SDPA and Flash Attention 2, which optimize the generation speed by up to 1.4 times compared to traditional methods. Additionally, Parler TTS benefits from compilation techniques that can accelerate the model’s performance by up to 4.5 times. The flexibility of this model is enhanced by its ability to be fine-tuned using simple text prompts, enabling precise adjustments in speech attributes without requiring extensive retraining.

This model has been trained on extensive datasets, including over 10,500 hours of audio, making it capable of delivering high-fidelity speech synthesis suitable for diverse applications in AI-driven communication, virtual assistants, and content creation.

Let’s now use it to build a Voice AI assistant.

Application Workflow

The diagram below explains our entire workflow. First, we use ASR to convert user queries into text. We then convert the query into embeddings and use it to perform a similarity search and then generate LLM responses. The responses will then be converted back to speech using a TTS model.

Using this workflow, you can build Voice AI assistants in several sectors.

‍

Prerequisites

Before running the code, make sure you have the following libraries installed:


! pip install sentence-transformers
! pip install qdrant_client langchain-community
! pip install gradio

You should also install Qdrant using a simple Docker command in the following way:


$ docker pull qdrant/qdrant
$ docker run -p 6333:6333 -p 6334:6334 \
   -v $(pwd)/qdrant_storage:/qdrant/storage:z \
   qdrant/qdrant

Step 1: Loading and Processing Customer Data

We start by loading customer data from a CSV file and processing it to create separate customer profiles that will later be used to generate responses. The code for processing may vary according to the data provided.


import pandas as pd


file_path = 'data.csv'
df = pd.read_csv(file_path,encoding='ISO-8859-1')


def create_profile(customer_name, df):
   customer_data = df[df['CustomerName'] == customer_name]
   customer_info = df[df['CustomerName'] == customer_name].iloc[0]
   customer_id = customer_info['CustomerID']


   transactions = [
       f"On {row['InvoiceDate']} , {customer_name} bought {row['Quantity']} {row['Description']}  for price: {row['UnitPrice']} (Stock Code: {row['StockCode']}, InvoiceNo:{row['InvoiceNo']}) in {row['Country']}"
       for _, row in customer_data.iterrows()
   ]
   history_text = "\n".join(transactions)
   profile=f"Name={customer_name},Customer Id={customer_id}\n"+history_text
  
   return history_text


def create_all_profiles(df):
   profiles = []
   unique_names = df['CustomerName'].unique()
  
   for name in unique_names:
       profiles.append(create_profile(name, df))
  
   return profiles
chunks=create_all_profiles(df)

Step 2: Encoding the Chunks Using a Pre-trained Embedding Model

You can use a pre-trained model like sentence-transformers/all-mpnet-base-v2 for turning chunks into embeddings by using the sentence-transformers library:


model = SentenceTransformer("sentence-transformers/all-mpnet-base-v2")
vectors = model.encode(chunks)

Step 3: Storing the Embeddings in Qdrant

Now, you can store these embeddings in a database like Qdrant, which can also be used for semantic searches. The choice of the vector database is yours.


from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams


client = QdrantClient(":memory:")


client.recreate_collection(
   collection_name="customer-profiles",
   vectors_config=VectorParams(size=len(vectors[0]), distance=Distance.COSINE),
)
client.upload_collection(
   collection_name="customer-profiles",
   ids=[i for i in range(len(chunks))],
   vectors=vectors,
)

Step 4: Implementing the Context Generation Function

We will now create a function that will fetch the context based on the query vector. It will use a similarity search to find document chunks closest to the query:


def make_context(question):
 ques_vector = model.encode(question)
 result = client.query_points(
     collection_name="customer-profiles",
     query=ques_vector,
 )
 sim_ids = []
 for i in result.points:
   sim_ids.append(i.id)
 context = ""
 for i in sim_ids[0:5]:
   context+=chunks[i]
 return context

Step 5: Generating Responses Using LLM

We can now use Ollama to access open-source multilingual models like Mistral to generate meaningful responses based on context – in this case, a customer profile provided.

For that, first install Ollama.


$ curl -fsSL https://ollama.com/install.sh | sh
$ ollama pull mistral-nemo
$ ollama run mistral-nemo

Now, you can use it in your code.


import ollama
def respond(question):
   stream = ollama.chat(
       model="mistal-nemo”,
       messages=[{'role':'user','content': f'This is the question asked by user {question} and the context given is {make_context(question)} answer this question based on the context provided in 1 to 2 lines '}],
   )
   return stream['message']['content']

Step 6: Adding Voice Recognition Using Wav2Vec2

To add voice interaction, we integrate the Wav2Vec2 model, which will convert the user's speech input into text format which the voice AI assistant uses to process voice queries effectively.


import torch
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
import librosa

import logging
logging.getLogger("transformers.modeling_utils").setLevel(logging.ERROR)

device = "cuda:0" if torch.cuda.is_available() else "cpu"
stt_model = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-large-960h").to(device)
processor = Wav2Vec2Processor.from_pretrained("facebook/wav2vec2-large-960h")

def transcribe_audio(audio_file):
   if audio_file is None:
       return "No audio file provided."
   audio, rate = librosa.load(audio_file, sr=16000)
   input_values = processor(audio, return_tensors="pt", padding="longest", sampling_rate=16000).input_values.to(device)
  
   with torch.no_grad():
       logits = stt_model(input_values).logits
  
   predicted_ids = torch.argmax(logits, dim=-1)
   transcription = processor.batch_decode(predicted_ids)
  
   return transcription[0].lower()

Step 7: Implementing Text-to-Speech Functionality Using Parler TTS

For adding interaction, we can integrate the Parler text-to-speech (TTS) model, which will convert the generated text response into an audio format with a certain description provided.


import torch
from parler_tts import ParlerTTSForConditionalGeneration
from transformers import AutoTokenizer
import soundfile as sf
device = "cuda:0" if torch.cuda.is_available() else "cpu"

tts_model = ParlerTTSForConditionalGeneration.from_pretrained("parler-tts/parler-tts-mini-v1").to(device)
tokenizer = AutoTokenizer.from_pretrained("parler-tts/parler-tts-mini-v1")

def text_to_speech_fun(prompt):
   description = "A male speaker ,with the speaker's voice sounding clear and very close up."
   input_ids = tokenizer(description, return_tensors="pt").input_ids.to(device)
   prompt_input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(device)
   generation = tts_model.generate(input_ids=input_ids, prompt_input_ids=prompt_input_ids)
   audio_arr = generation.cpu().numpy().squeeze()
   sf.write("parler_tts_out.wav", audio_arr, tts_model.config.sampling_rate)

Step 8: Combining All Functions

We combine all functions to receive and process an audio input and return an html response for bot response and audio output.


def generate_response_from_audio(audio):
  
   user_query = transcribe_audio(audio)
  
   bot_response = respond(user_query)
   text_to_speech_fun(bot_response)
  
   audio_file = "parler_tts_out.wav"
  
   html_response = f"""
   
      
       Input:
       {user_query}
       Response:
       {bot_response}
   
   """
  
   return html_response, audio_file

Step 9: Integrating with the Gradio Interface

Finally, we can use Gradio to create a simple web interface for users to interact with the voice AI assistant. This interface will allow users to speak their queries and receive text as well as audio responses.


import gradio as gr


with gr.Blocks() as demo:
   gr.Markdown(
       """
       
           Voice AI Assistant
       
       """
   )


   gr.Markdown(
       """
       
           Hello! How may i help you
       
       """
   )


   gr.Markdown("")


   audio_input = gr.Audio(type="filepath", label="Record your query")


   output = gr.HTML()
   audio_output = gr.Audio()


   audio_input.change(fn=generate_response_from_audio, inputs=audio_input, outputs=[output, audio_output])


demo.launch(share=True)

Output

‍

Conclusion

By following this guide, you can create a powerful Voice AI Assistant for e-commerce that can understand customer queries via audio input, retrieve relevant information from customer profiles, and respond with text as well as audio output. This project combines powerful tools like LangChain, Qdrant, Wav2Vec2, Parler(TTS), and Gradio to deliver a highly interactive and intelligent user experience.

Sign up to E2E Cloud today to start building a bi-directional voice AI chatbot for the e-commerce domain. You can also reach out to us at sales@e2enetworks.com to learn how to build data-sovereign AI on our MeitY-empanelled cloud platform, or for availing startup credits.

Sign up for Free Trial

Latest Blogs