Audio-Driven Search: Leveraging Vector Databases for Audio Information Retrieval

In the era of advanced artificial intelligence, Generative AI models have taken center stage, revolutionizing the way we interact with data. Models like DALL-E and Jukebox are capable of generating astonishingly realistic images and audio, thanks to their ability to learn from vast datasets and create human-like creative outputs.

While these AI models often steal the spotlight, there's a hidden hero working behind the scenes — the vector database. Modern vector databases, designed for efficiently storing and retrieving vector representations of data, play a pivotal role in the success of Generative AI models in real-world applications. In this article, we'll delve into the inner workings of vector databases and their crucial role in audio information retrieval.

How Do Vector Databases Work?

Before we explore the significance of vector databases, it's essential to understand how they differ from traditional databases. Traditional databases store data in tabular format, with rows and columns, while vector databases employ numeric vectors to represent and store data.

Vector Representations: At the heart of vector databases lies the concept of representing data as numeric vectors. These vectors serve as digital signatures, encapsulating the essence of the data. For instance, an image of a cat could be encoded as a 512-dimensional vector, like [0.23, 0.54, 0.32, …, 0.12, 0.45, 0.90], while text data can be transformed into vectors based on the underlying semantics.
Generating Vectors: Vectors can be generated in various ways, including through machine learning models like Word2Vec, BERT, and CLIP, data hashing techniques such as SimHash and MinHash, and data indexing methods that extract and combine features from text and images.
Storing Vectors Efficiently: Once data is vectorized, vector databases offer various capabilities for efficient storage. These include compact storage, memory caching for faster retrieval, a distributed architecture that allows vectors to be distributed across nodes for scalability, and a columnar data layout for efficient analytical querying.

These techniques enable vector databases to store vast amounts of vector data effectively, making them a critical component of Generative AI.

Vector Database Capabilities

The vector data model provides specialized database functionalities tailored for AI applications, including:

Ultra-Fast Similarity Search: Vector databases excel at rapidly finding vectors similar to a query vector. This capability is vital for Generative AI, allowing applications like image search, recommendations, and anomaly detection.
Approximate Nearest Neighbors: Algorithms like HNSW enable approximate nearest neighbor searches, offering significant speed improvements with minimal accuracy loss.
Support for Sparse Vectors: Real-world vectors often exhibit sparsity, meaning they have relatively few non-zero dimensions. Vector databases employ specialized compression techniques to reduce storage requirements for sparse vectors while enabling fast distance calculations.
Semantic Vector Search: Query vectors can be searched by semantic meaning, not just similarity. For instance, you can find vectors conceptually related to ‘dog’ like ‘cat’, ‘wolf’ and ‘pet’.
Hybrid Vector + Metadata Search: Vector databases allow for powerful hybrid queries that combine vector similarity with traditional metadata filters, such as names, dates, and tags.
AI Model Integration: Vector databases can be tightly integrated with machine learning libraries like PyTorch and TensorFlow for model training and inference directly on vector datasets.

These unique capabilities of vector databases open the door to novel data discovery methods that fuel cutting-edge AI applications.

Role of Vector Databases in AI Applications

Vector databases are the backbone of modern AI applications. They play pivotal roles in various aspects of AI, including:

Training Data for Generative AI Models: Massive vector datasets, compiled from diverse sources, serve as training data for Generative AI models like DALL-E and Jukebox. These models derive their understanding of the world from analyzing these vector patterns.
Few-Shot Learning: With a vector index in place, only a few example vectors are required for few-shot learning. This allows models to learn new concepts rapidly by observing vector proximity.
In-Context Learning: In-context learning permits the incorporation of new training examples into model inputs at runtime, enabling dynamic adaptation.
Recommender Systems: Recommender engines utilize vector databases to suggest relevant content by finding vectors similar to a user's interests based on their profile, behaviors, and queries.
Semantic Information Retrieval: Vector databases enable the retrieval of documents or media by semantic similarity to input text or image vectors, shifting the focus from keyword matching to understanding user intent.
Anomaly Detection: Vector databases aid in identifying anomalous data instances by detecting vectors that deviate from expected clusters. This capability is crucial for spotting potential fraud or system faults.
Hybrid Recommendations: Hybrid recommendation systems combine collaborative filtering based on vector similarity with content-based filtering using metadata to provide highly relevant recommendations.
Multimodal Search: Vector databases can jointly analyze vectors from different modalities, such as text, images, audio, and video, for unified multimodal search and analytics.

The Challenge of Audio Information Retrieval

Traditionally, searching for specific audio content has been a daunting task. Keyword-based searches can be unreliable, as they rely on manual tagging or transcription, which can be time-consuming and error-prone. Moreover, they often fail to capture the nuances and characteristics of audio that are essential for accurate retrieval.

This is where audio-driven search comes into play. By utilizing advanced machine learning techniques and vector databases, we can transform the way we search, access, and manage audio data.

Real-World Applications

Audio-driven search has numerous applications across various industries:

Music Streaming: Services like Spotify use vector databases to offer personalized music recommendations and discover new tracks that match users' preferences.
Voice Assistants: Vector databases help voice assistants like Siri and Google Assistant understand and respond to voice commands more accurately.
Content Libraries: Media organizations can efficiently search and retrieve audio content for content creation, news reporting, and archives.
Security and Surveillance: Vector databases are used for audio-based surveillance, helping identify specific sounds or spoken words in real-time.

Vector Databases: The Backbone of Audio-Driven Search

Vector databases are a key component of audio-driven search. These databases store and efficiently manage high-dimensional vectors that represent the features of audio content. Through machine learning models, audio data is transformed into vectors that encapsulate information about the content's characteristics, such as pitch, tempo, spectral features, and more. These vectors become the basis for fast and accurate searching.

Here's how vector databases work in audio-driven search:

Feature Extraction: Audio content is processed to extract relevant features. These features could include MFCCs (Mel-frequency cepstral coefficients), spectrograms, or embeddings from deep learning models.
Vectorization: The extracted features are transformed into high-dimensional vectors. These vectors represent the unique audio content's characteristics and are ready for storage and retrieval.
Vector Database Storage: The vectors are stored efficiently in a vector database. These databases are optimized for similarity searches, allowing users to compare and retrieve audio content based on vector similarity.

Tutorial: Getting Started with Qdrant’s Self-Hosted Vector DB and Audio Data

This is a tutorial on vector databases and music recommendation systems using Python and Qdrant. In this tutorial, we'll explore how to work with audio data, embeddings, and vector databases to create your own music recommendation engine. We'll use the Ludwig Music Dataset (Moods and Subgenres) from Kaggle, which contains over 10,000 songs of different genres and subgenres.

If you require extra GPU resources for the tutorials ahead, you can explore the offerings on E2E CLOUD, which provides a diverse selection of GPUs, making E2E a suitable choice for more advanced LLM-based applications.

Prerequisites

Before we begin, make sure you have:

Downloaded the Ludwig Music Dataset (Moods and Subgenres) from Kaggle. The dataset includes an mp3 directory and a labels.json file.
Created a virtual environment (if not in Google Colab) for your project. You can use conda or mamba to create an environment and activate it, or use virtualenv.

# Using conda or mamba
mamba env create -n my_env python=3.10
mamba activate my_env
# Using virtualenv
python -m venv venv
source venv/bin/activate

Installed the required packages using pip. You can use the following command to install them:

pip install qdrant-client transformers datasets pandas numpy torch librosa tensorflow openl3 panns-inference pedalboard streamlit

Set up Qdrant by running it in a Docker container. If you don't have Docker installed on your machine, you can find installation instructions in the official documentation here. After Docker is installed, follow these steps:

Pull the Qdrant Docker image

docker pull qdrant/qdrant

Start Qdrant with the following command:

docker run -p 6333:6333 \
    -v $(pwd)/qdrant_storage:/qdrant/storage \
    qdrant/qdrant

Verify that Qdrant is running and accessible by importing the required libraries and connecting to Qdrant via its Python client.

from transformers import AutoFeatureExtractor, AutoModel
from IPython.display import Audio as player
from datasets import load_dataset, Audio
from panns_inference import AudioTagging
from qdrant_client import QdrantClient
from qdrant_client.http import models
from os.path import join
from glob import glob
import pandas as pd
import numpy as np
import librosa
import openl3
import torch

client = QdrantClient(host="localhost", port=6333)

‍

We will also go ahead and create the collection for this tutorial. The dimensions will be of size 2048, and we'll set the distance metric to cosine similarity.

my_collection = "music_collection"
client.recreate_collection(
    collection_name=my_collection,
    vectors_config=models.VectorParams(size=2048, distance=models.Distance.COSINE)
)

Overview

The dataset we are using is the Ludwig Music Dataset (Moods and Subgenres) from Kaggle, which was collected for music information retrieval (MIR) by Discogs and AcousticBrainZ. It contains over 10,000 songs of different genres and subgenres. The dataset is quite large (12GB), so it's recommended to download your favorite genre from the mp3 directory and the labels.json file to follow along with the tutorial.

Once you've downloaded the dataset, you should see the following directories and files:

../data/ludwig_music_data
├── labels.json
├── mp3
│   ├── blues
│   ├── ...
│   └── rock
├── spectogram
│   └── spectogram
└── subgenres.json

The labels.json file contains metadata such as artist, subgenre, album, and more associated with each song.

The spectrogram directory contains spectrograms, which are visual representations of the frequencies present in an audio signal over time. Spectrograms are useful for visualizing audio data.

Data Preparation

We'll start by extracting the metadata and audio files from the dataset. The code snippet below loads the data, resamples the audio to a common sampling rate, and extracts the metadata.

data_path = join("..", "data", "ludwig_music_data")
data_path

music_data = load_dataset(
    "audiofolder", data_dir=join(data_path, "mp3", "latin"), split="train", drop_labels=True
)
music_data

music_data[115]

As you can see, we got back json objects with an array representing our songs, the path to where each one of them is located in our PC, and the sampling rate for each. Let's play the song at index 115 and see what it sounds like.

player(music_data[115]['audio']['array'], rate=44100)

We'll need to extract the name of each mp3 file as this is the unique identifier we'll use in order to get the corresponding metadata for each song. While we are at it, we will also create a range of numbers and add it as the index to the dataset.

ids = [
    (
     music_data[i] # for every sample
     ['audio'] # in this directory
     ['path'] # extract the path
     .split("/") # split it by /
     [-1] # take only the last piece "id.mp3"
     .replace(".mp3", '') # and replace the .mp3 with nothing
    )
    for i in range(len(music_data))
]
index = [num for num in range(len(music_data))]
ids[:4]

music_data = music_data.add_column("index", index)
music_data = music_data.add_column("ids", ids)
music_data[-1]

The metadata we will use for our payload lives in the labels.json file, so let's extract it.

label_path = join(data_path, "labels.json")
labels = pd.read_json(label_path)
labels.head()

As you can see, the dictionaries above contain a lot of useful information. Let's create a function to extract the data we want to retrieve for our recommendation system.

def get_metadata(x):
    cols = ['artist', 'genre', 'name', 'subgenres']
    list_of_cols = []
    for col in cols:
        try:
            mdata = list(x[col].values())[0]
        except:
            mdata = "Unknown"
        list_of_cols.append(mdata)

   return pd.Series(list_of_cols, index=cols)

The last piece of the puzzle is to clean the subgenres a bit, and to extract the path to each of the files since we will need them to load the recommendations in our app later on.

def get_vals(genres):
    genre_list = []
    for dicts in genres:
        if type(dicts) != str:
            for _, val in dicts.items():
                genre_list.append(val)
    return genre_list

clean_labels['subgenres'] = clean_labels.subgenres.apply(get_vals)
clean_labels['subgenres'].head()

file_path = join(data_path, "mp3", "latin", "*.mp3")
files = glob(file_path)
ids = [i.split('/')[-1].replace(".mp3", '') for i in files]
music_paths = pd.DataFrame(zip(ids, files), columns=["ids", 'urls'])
music_paths.head()

We'll combine all files with metadata into one dataframe and then format it as a list of JSON objects for our payload.

metadata = (music_data.select_columns(['index', 'ids'])
                     .to_pandas()
                     .merge(right=clean_labels, how="left", left_on='ids', right_on='index')
                     .merge(right=music_paths, how="left", left_on='ids', right_on='ids')
                     .drop("index_y", axis=1)
                     .rename({"index_x": "index"}, axis=1)
        )
metadata.head()

payload = metadata.drop(['index', 'ids'], axis=1).to_dict(orient="records")
payload[:3]

Audio Embeddings

Audio embeddings are compact, low-dimensional vector representations of audio signals. They effectively capture essential acoustic attributes like pitch, timbre, and spatial characteristics of sound. These embeddings serve as meaningful, condensed descriptions of audio data, finding application in a wide range of downstream audio processing tasks, including but not limited to speech recognition, speaker recognition, music genre classification, and event detection. Typically, these embeddings are derived by employing deep neural networks, which take raw audio as input and produce a learned, lower-dimensional feature representation of that audio. Moreover, they can be employed as inputs for subsequent machine learning models.

To embark on creating audio embeddings for your songs, you have several options:

Train a deep neural network from scratch on your specific dataset and extract the resulting embedding layer.
Utilize pre-trained models and the Python Transformers library.
Employ specialized libraries like openl3 and panns_inference.

Although other methods exist, we will focus on approaches 2 and 3 here: the Transformers architecture along with the openl3 and panns_inference libraries.

Important Note: While three approaches are presented, you only need to select one for this tutorial. In this context, we will proceed with the panns_inference method.

Now, let's dive into the process using the panns_inference approach.

OpenL3

OpenL3 stands as an open-source Python library tailored for computing deep embeddings from audio and image data. Its purpose is to provide a user-friendly framework for extracting embeddings using pre-trained deep neural network models. The library encompasses pre-trained audio models such as VGGish, YAMNet, and SoundNet, along with pre-trained image models like ResNet and Inception. These models find application in a multitude of audio and image processing tasks, ranging from speech recognition to music genre classification and object detection. In essence, OpenL3 facilitates the integration of deep learning models into the workflows of researchers and developers.

Now, let's proceed by loading an audio file and extracting the embedding layer with OpenL3.

one_song = join(data_path, "mp3", "latin", "0rXvhxGisD2djBmNkrv5Gt.mp3")
audio, sr = librosa.core.load(one_song, sr=44100, mono=True)
audio.shape

player(audio, rate=sr)

open_emb, ts = openl3.get_audio_embedding(audio, sr, input_repr="mel128", frontend='librosa')

The model returns an embedding vector for each timestamp and a timestamp vector. This means that to get a one dimensional embedding for the whole song, we'll need to get the mean of these vectors.

open_emb.shape, open_emb.mean(axis=0).shape, open_emb.mean(axis=0)[:20]

You can generate your embedding layer for the whole dataset with the following function. Note that loading the model first, in particular Kapre, will work on a GPU without any further configuration.

model_kapre = openl3.models.load_audio_embedding_model(
    input_repr='mel128', content_type='music', embedding_size=512
)
def get_open_embs(batch):
    audio_arrays = [song['array'] for song in batch['audio']]
    sr_arrays = [song['sampling_rate'] for song in batch['audio']]
    embs_list, _ = openl3.get_audio_embedding(audio_arrays, sr_arrays, model=model_kapre)
    batch["open_embeddings"] = np.array([embedding.mean(axis=0) for embedding in embs_list])
    return batch

music_data = music_data.map(get_open_embs, batched=True, batch_size=20)
music_data

The good thing about OpenL3 is that it comes with the best model for our task. The downside is that it is the slowest of the three methods showcased here.

PANNs Inference

PANNs Inference is a Python library, built on the foundation of PyTorch and torchaudio, designed to facilitate audio tagging and sound event detection tasks. It leverages convolutional neural network (CNN)-based models that have been trained on extensive audio datasets like AudioSet and UrbanSound8K. The primary goal behind this library is to simplify the utilization of these pre-trained models for researchers and practitioners, enabling them to perform inference on their own audio datasets without the need to embark on the arduous process of training models from the ground up. PANNs Inference offers a user-friendly, high-level API, streamlining the process of loading pre-trained models, generating embeddings, and conducting audio classification tasks with just a few lines of code.

To work with the PANNs Inference package, your data should be in either a numpy array or a torch tensor format, both conforming to the shape [batch, vector]. Therefore, let's adjust the format of our audio data accordingly.

audio2 = audio[None, :]
audio2.shape

Bear in mind that this next step, downloading the model, can take quite a bit of time depending on your internet speed. Afterwards, inference is quite fast and the model will return to us two vectors, the timestamps and the embeddings.

at = AudioTagging(checkpoint_path=None, device='cuda')

clipwise_output, embedding = at.inference(audio2)‍

clipwise_output.shape, embedding.shape

embedding[0, 470:500]

To get an embedding layer for all of the songs using the panns_inference package, you can use the following function. This is the output we will be using for the remainder of the tutorial.

def get_panns_embs(batch):
    arrays = [torch.tensor(val['array'], dtype=torch.float64) for val in batch['audio']]
    inputs = torch.nn.utils.rnn.pad_sequence(arrays, batch_first=True, padding_value=0).type(torch.cuda.FloatTensor)
    _, embedding = at.inference(inputs)
    batch['panns_embeddings'] = embedding
    return batch

music_data = music_data.map(get_panns_embs, batched=True, batch_size=8)
music_data

The Transformers

Transformers represent a class of neural networks primarily employed in the realm of natural language processing. However, this versatile architecture can also be harnessed for the purpose of audio data processing. In this context, it dissects audio signals into smaller segments, learning how these fragments interconnect to convey significance.

One approach to leverage Transformers for audio data is to load a pre-trained model from the Hugging Face hub and extract embeddings from it. It is worth noting that this approach tends to yield the least favorable results out of the three methods. This is because Wav2Vec was originally trained to discern speech rather than classify music genres. Consequently, it's important to acknowledge that fine-tuning Wav2Vec for the specific data might not significantly enhance the quality of the embeddings.

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = AutoModel.from_pretrained('facebook/wav2vec2-base').to(device)
feature_extractor = AutoFeatureExtractor.from_pretrained('facebook/wav2vec2-base')

A key step before extracting the features from each song and passing them through the model is to resample the songs 16kHz.

resampled_audio = librosa.resample(y=audio2, orig_sr=sr, target_sr=16_000)
display(player(resampled_audio, rate=16_000))
resampled_audio.shape

inputs = feature_extractor(
    resampled_audio[0], sampling_rate=feature_extractor.sampling_rate, return_tensors="pt",
    padding=True, return_attention_mask=True, truncation=True, max_length=16_000
).to(device)

inputs['input_values'].shape

with torch.no_grad():
    embeddings = model(**inputs).last_hidden_state.mean(dim=1)
embeddings.shape

To generate the embedding layer for the whole dataset, we can use the following function.

def get_trans_embs(batch):
    audio_arrays = [x["array"] for x in batch["audio"]]

   inputs = feature_extractor(
        audio_arrays, sampling_rate=16_000, return_tensors="pt", padding=True,
        return_attention_mask=True, max_length=16_000, truncation=True
    ).to(device)

   with torch.no_grad():
        pooled_embeds = model(**inputs).last_hidden_state.mean(dim=1)
   
    return {"transform_embeddings": pooled_embeds.cpu().numpy()}

music_data = music_data.cast_column("audio", Audio(sampling_rate=16_000))
music_data = music_data.map(embed_audio, batched=True, batch_size=20)
music_data

Creating a Recommendation System

Recommendation systems are a category of algorithms and methodologies designed to propose items or content to users based on their individual preferences, historical data, or behavioral patterns. The primary objective of these systems is to offer personalized suggestions to users, facilitating the discovery of new items of interest and enhancing their overall user experience. Recommendation systems find extensive applications across diverse domains, including e-commerce, streaming platforms, social media, and many others.

To get started, we will populate the collection we previously established. If you have chosen the Transformers approach or OpenL3 for this journey, you will need to recreate your collection with the appropriate dimension size.

client.upsert(
    collection_name=my_collection,
    points=models.Batch(
        ids=music_data['index'],
        vectors=music_data['panns_embeddings'],
        payloads=payload
    )
)

We can retrieve any song by its id using client.retrieve() and then extract the information in the payload with the .payload attribute.

result = client.retrieve(
    collection_name=my_collection,
    ids=[100],
    with_vectors=True # we can turn this on and off depending on our needs
)

result[0].payload
r = librosa.core.load(result[0].payload['urls'], sr=44100, mono=True)
player(r[0], rate=r[1])

You can search for similar songs with the client.search() method. Let's find an artist and a song we like and use that id to grab the embedding and search for similar songs.

metadata.query("artist == 'Celia Cruz'")

client.search(
    collection_name=my_collection,
    query_vector=music_data[150]['panns_embeddings'],
    limit=10
)

You can evaluate the search results by looking at the score or by listening to the songs and judging how similar they really are. I, the author, can vouch for the quality of the ones we got for Celia Cruz.

The recommendation API works a bit differently – we don't need a vector query but rather the ids of positive (required) vectors and negative (optional) ones, and Qdrant will do the heavy lifting for us.

client.recommend(
    collection_name=my_collection,
    positive=[178, 122],
    limit=5
)

Say we don't likeChayanne. We can use the id of one of his mushiest songs so that Qdrant gets us results as far away as possible from such a song.

metadata.query("artist == 'Chayanne'"
client.recommend(
    collection_name=my_collection,
    positive=[178, 122],
    negative=[385],
    limit=5
)

Say we want to get recommendations based on a song we just recently listened to and liked, and the system remembers all of our preferences.

marc_anthony_valio_la_pena = music_data[301]

client.recommend(
    collection_name=my_collection,
    positive=[marc_anthony_valio_la_pena['idx'], 178, 122, 459],
    negative=[385],
    limit=5
)

Hence, you have made an audio search system. You can even host it using its support from Streamlit.

The Benefits of Audio-Driven Search

Audio-driven search offers several advantages over traditional methods:

Efficiency: Searching for audio content becomes much faster, as the similarity between audio clips is calculated directly from their vectors.
Accuracy: Audio-driven search can retrieve content based on acoustic similarity, making it more accurate and robust to variations like background noise, accents, or variations in pronunciation.
Scalability: Vector databases are designed to handle large datasets, making them suitable for organizations with extensive audio libraries.
Content Discovery: Users can discover similar audio content even if they don't know the exact keywords or tags associated with it. <l

Audio-Driven Search: Leveraging Vector Databases for Audio Information Retrieval

How Do Vector Databases Work?

Vector Database Capabilities

Role of Vector Databases in AI Applications

The Challenge of Audio Information Retrieval

Real-World Applications

Vector Databases: The Backbone of Audio-Driven Search

Tutorial: Getting Started with Qdrant’s Self-Hosted Vector DB and Audio Data

Prerequisites

Overview

Data Preparation

Audio Embeddings

OpenL3

PANNs Inference

The Transformers

Creating a Recommendation System

The Benefits of Audio-Driven Search

Related Articles

Making AI Deployment Affordable and Scalable: Cost Efficiency of Quantization

Interpretable vs. Black-Box Models: A Comprehensive Exploration on Early Prediction under Uncertainty

Generative AI in Healthcare: Applications, Benefits, and Its Future

Company

Legal & Policies

Investor Relations

Resources