Vector databases have been hitting the headlines recently. Pinecone raised a $100 million Series B at a $750 million valuation. Weaviate announced a $50M round of funding. There are other similar databases like Chroma, Qdrant, Milvus, Typesense – the list goes on. Fuelled by generative AI, these databases are fast-tracking the automation process and making the lives of AI engineers and data scientists easier.
What Are Vector Databases?
Vector databases are databases that specialize in the housing or development of vectors - where vectors are essentially a concept where it takes a piece of data and converts it into a string of numbers. Once the data is transformed into vectors, or vector embeddings that are stored in a database, we can perform searches and operations on them. What is unique about these databases is that they store the data only in the form of numbers which the computer understands. This drastically increases the robustness and decreases both computational and time complexity. It also helps to build relationships between various entities in the database. It ensures that data is organized in an intelligent manner.
Why Do We Need to Shift from Relational Databases?
Data is everywhere. Take, for instance, an employee database:
This database is an example of a traditional database. If you query for a specific department or any other parameter, you will get back the result easily. Traditional databases store data in rows and columns. They have different data types. They have both numerical and categorical data types mixed together.
Now, we have all kinds of unstructured data flooding the internet. It becomes hard to browse through all of them at once and extract the desired data of your choice. Even major search engines fail to generate the desired results when we query for a specific topic.
In comes vector databases. Before the advent of vector databases, data analysts and scientists had to go through tons of data-wrangling processes before actually working on the data. The generative nature of vector databases has solved this problem to a great extent. Hence, they save a lot of time and energy.
What Happens Under the Hood?
Vector databases use algorithms like Approximate Nearest Neighbours (ANN) to query through the docs. Above, we can see how same-colored data points are linked together with one another. This facilitates storage, retrieval, and other similarity search operations.
Text, images, or audio data that are stored in the form of embeddings are capable of retaining their meaning by saving similar data closer to each other. This ensures long-term memory.
Major Players in the Industry
The following table depicts the major players in the industry of vector databases:
The below plot depicts the interest in vector databases over time.
Build an Image Search Engine from Scratch on E2E Cloud
We will build an image search engine that takes an image as an input and will return a similar image as an output. We will be using the Weaviate vector database, which is open-source. Follow the steps below:
Step 1: Setting Up on the E2E Cloud
Log in to your account with your credentials and create a node:
Create a vGPU NVIDIA A100 node:
Go to your local terminal, and log in to your account via ssh:
Enter the password when prompted to. Then add your public ssh keys and disable password login.
Step 2: Installation
It is always a good practice to update the machine:
Moving on, install the stable version of Node.JS using the following commands:
Install npm:
Docker comes pre-installed in the machine. Now, we are done with the installation process.
Step 3: Setting Up Weaviate
Now to run Weaviate, note that it provides a docker-compose wizard from where we can select options.
Click Next:
Click Next:
Click Next:
Click Next:
Disable these options and set them to false.
Copy the docker-compose command:
It will generate the following output:
Step 4: Install the Weaviate Client
Step 5: Initialize the Client
Step 6: Create a Schema
Step 7: Store an Image
Step 8: Query an Image
Step 9: Create a Directory and Store Some Meme Images
Insert at least 5 meme images in this directory and name the directory, ‘img’. Also, add a test image to query and find a similar image.
Step 10: Run index.js
This will show the following output:
An image result.jpg will be created in the directory.
Our test image is:
Result.jpg is one of the images in our database. Hence, it returns a similar output.
Conclusion
In this article, we saw how to create an image search engine on our own. As time passes, we can add more images to the database and populate it. Then our search engine will become more efficient. With the power of generative AI, we have seen how E2E Cloud can be used to create our own search engine powered by Weaviate.
References
[1] https://unzip.dev/p/0x014-vector-databases/