In today's digital age, AI-powered voice assistants have become necessary for e-commerce businesses to improve customer interactions. These voice assistants are used to provide real-time, personalized responses, enabling companies to offer an interactive and better shopping experience.
By adding advanced voice recognition technologies like Wav2Vec2 which are open source, these assistants can accurately interpret spoken language through which customers can interact more naturally and smoothly. This integration also streamlines the buying process, making it easier for customers to find what they need through simple voice commands.
In this article, we will guide you on how you can build an AI-powered voice assistant for e-commerce. We will be focusing on integrating Wav2Vec2 for enhanced user interaction, setting up necessary tools and integrating them with a better interface, showing how you can create your own responsive, voice-enabled AI-powered virtual assistant that can satisfy your customers' needs 24/7.
What Is Wav2Vec2
Wav2Vec2 is a state-of-the-art automatic speech recognition (ASR) model developed by Meta AI. Unlike traditional speech recognition systems that rely heavily on pre-defined features and manual labeling, Wav2Vec2 uses unsupervised learning to understand the structure of speech, making it more accurate and robust across different accents and languages.
The need for Wav2Vec2 arises from the increasing demand for voice-based applications where users prefer to interact using natural speech instead of typing. Using Wav2Vec2, our AI Assistant can transcribe audio input into text, helping the system understand and process spoken queries. This capability is essential for applications where users might be busy, have difficulty typing, or prefer the convenience of voice commands.
How to Use Wav2Vec2
We will first showcase the steps to use Wav2Vec2. After that, we will show how to build a Voice AI assistant in the e-commerce domain that uses RAG architecture to provide contextual outputs.
To get started, first sign up to E2E Cloud and launch a cloud GPU node. E2E Cloud offers the most price-performant cloud GPUs in the Indian market. You get blazing-fast performance, at a price point that’s far cheaper than AWS, GCP or Azure. Check the pricing page to learn more.
When launching the cloud GPU, make sure you add your public SSH key (id_rsa.pub). This will allow you to remote SSH into the node in the following way:
Once you have logged in, create a user using adduser command, and add the user to the sudo-ers list using visudo.
You can now create a Python virtual environment.
Then install the dependencies:
You can now install Jupyter Lab and then use that to build this example:
Testing Wav2Vec2
Let’s see how Wav2Vec2 works:
Set Device as GPU.
Initialize Model and processor.
Select path of audio file and input values.
Generate text output.
What Is Parler TTS
Along with ASR, we will also use a TTS (text-to-speech) model. Parler TTS is an advanced, open-source text-to-speech (TTS) model developed to generate high-quality, natural-sounding speech with a high degree of control over various features such as gender, pitch, speaking style, and background noise. Leveraging the power of an auto-regressive transformer-based architecture, Parler TTS generates speech by creating audio tokens in a causal manner, allowing for real-time streaming of audio as it is produced. This reduces latency significantly, providing near-instantaneous audio output when using modern GPUs.
The model supports efficient attention mechanisms, like SDPA and Flash Attention 2, which optimize the generation speed by up to 1.4 times compared to traditional methods. Additionally, Parler TTS benefits from compilation techniques that can accelerate the model’s performance by up to 4.5 times. The flexibility of this model is enhanced by its ability to be fine-tuned using simple text prompts, enabling precise adjustments in speech attributes without requiring extensive retraining.
This model has been trained on extensive datasets, including over 10,500 hours of audio, making it capable of delivering high-fidelity speech synthesis suitable for diverse applications in AI-driven communication, virtual assistants, and content creation.
Let’s now use it to build a Voice AI assistant.
Application Workflow
The diagram below explains our entire workflow. First, we use ASR to convert user queries into text. We then convert the query into embeddings and use it to perform a similarity search and then generate LLM responses. The responses will then be converted back to speech using a TTS model.
Using this workflow, you can build Voice AI assistants in several sectors.
Prerequisites
Before running the code, make sure you have the following libraries installed:
You should also install Qdrant using a simple Docker command in the following way:
Step 1: Loading and Processing Customer Data
We start by loading customer data from a CSV file and processing it to create separate customer profiles that will later be used to generate responses. The code for processing may vary according to the data provided.
Step 2: Encoding the Chunks Using a Pre-trained Embedding Model
You can use a pre-trained model like sentence-transformers/all-mpnet-base-v2 for turning chunks into embeddings by using the sentence-transformers library:
Step 3: Storing the Embeddings in Qdrant
Now, you can store these embeddings in a database like Qdrant, which can also be used for semantic searches. The choice of the vector database is yours.
Step 4: Implementing the Context Generation Function
We will now create a function that will fetch the context based on the query vector. It will use a similarity search to find document chunks closest to the query:
Step 5: Generating Responses Using LLM
We can now use Ollama to access open-source multilingual models like Mistral to generate meaningful responses based on context – in this case, a customer profile provided.
For that, first install Ollama.
Now, you can use it in your code.
Step 6: Adding Voice Recognition Using Wav2Vec2
To add voice interaction, we integrate the Wav2Vec2 model, which will convert the user's speech input into text format which the voice AI assistant uses to process voice queries effectively.
Step 7: Implementing Text-to-Speech Functionality Using Parler TTS
For adding interaction, we can integrate the Parler text-to-speech (TTS) model, which will convert the generated text response into an audio format with a certain description provided.
Step 8: Combining All Functions
We combine all functions to receive and process an audio input and return an html response for bot response and audio output.
Step 9: Integrating with the Gradio Interface
Finally, we can use Gradio to create a simple web interface for users to interact with the voice AI assistant. This interface will allow users to speak their queries and receive text as well as audio responses.
Output
Conclusion
By following this guide, you can create a powerful Voice AI Assistant for e-commerce that can understand customer queries via audio input, retrieve relevant information from customer profiles, and respond with text as well as audio output. This project combines powerful tools like LangChain, Qdrant, Wav2Vec2, Parler(TTS), and Gradio to deliver a highly interactive and intelligent user experience.
Sign up to E2E Cloud today to start building a bi-directional voice AI chatbot for the e-commerce domain. You can also reach out to us at sales@e2enetworks.com to learn how to build data-sovereign AI on our MeitY-empanelled cloud platform, or for availing startup credits.