Taken from YOLO World Paper
Have you ever felt stuck when an object detection model fails to identify an object because it's not trained on it? Or, have you felt frustrated when you had to train a new model from scratch to identify a new object? Not anymore! YOLO-World has come to save the day. It's an object detection model that can identify any object you want, provided that you give an image description. It's a model that can see beyond labels.
Why YOLO-World?
YOLO-World, developed by Tencent AI Lab - Computer Vision Center, is a novel object detection model that can identify objects in an offline vocabulary setting. It is a fusion of vision and language models, which can identify objects based on a textual description. In a nutshell, it fuses the features extracted from the vision model with those of embeddings extracted from the language model to understand the correlation between the image and its description. This type of fusion between text and image allows for the recognition of objects that are not present in the training data and offer a better understanding of the context of the image.
YOLO-World was developed to address the limitations of Fixed-Vocabulary Detectors and Detectors that use Online-Vocabulary during inferencing. What YOLO-World does is use open vocabulary instead of fixed vocabulary, but it doesn't stop there. It also uses offline vocabulary instead of online vocabulary during inferencing. Now, you must be wondering and confused with all these fixed, open, online, and offline vocabularies. Let me explain them to you.
Taken from YOLO World Paper
Fixed Vocabulary
Fixed-Vocabulary Detectors can only identify objects that are present in the training data simply because they are trained on a fixed set of categories. New objects cannot be detected in the training data. These are basically traditional detectors that we use in everyday life. So, the biggest drawback is that the model can't find anything that isn't in the training data.
Open Vocabulary
Open-Vocabulary Detectors solve the problem we had previously with Fixed-Vocabulary Detectors. New categories not present in the training data can be identified. Generally, this is achieved via fusion with language/prompt encoders. It encodes the prompt given by the user and uses these embeddings along with features extracted from the image to identify the object.
Online Vocabulary
Online-Vocabulary Detectors use the open vocabulary settings we just saw, that is, basically encoding the prompt given by the user to create an open vocabulary and detect objects using these vocabulary words. But this, again, has a drawback. These types of models rely on heavy backbones to increase the open-vocabulary capacity. This makes the model heavy and slow.
Offline Vocabulary
So, the next logical step is to somehow make it lightning fast and yet possess the power of open vocabulary. How's that done? Offline Vocabulary! Well, you just train your model in the same online vocabulary setting, but while inferencing, you switch to an offline vocabulary setting. Sounds simple, right? This is what YOLO-World does. It uses an open-vocabulary setting during training and an offline-vocabulary setting during inferencing. This makes the model fast and suitable for real-world applications.
Let's now get into the details of the model architecture.
Instead of relying solely on the bounding boxes, YOLO-World uses something called region-text pairs. Imagine dividing an image into areas, each assigned a textual description that highlights specific features. This provides a deeper understanding of the whole image and its content placement.
YOLO-World essentially has three components. It's inspired by YOLO v8 and has DarkNet as the backbone, Path Aggregation Network (PAN), and Bounding Box Regression & Object Embeddings. Let's get into the details of each of these components.
DarkNet
DarkNet feature extractor, first proposed in the YOLO9000 paper, is a convolutional neural network that serves as the image encoder in the YOLO-World model. It is a 53-layer deep neural network that is trained on the COCO and ImageNet datasets. It is used to extract visual features from the input image. The DarkNet architecture is designed to be relatively lightweight while maintaining high performance. It consists of convolutional layers with residual connections. Originally, it was developed for image classification tasks but was later adapted to object detection tasks.
Taken from YOLOv3 Paper
CLIP (Contrastive Language-Image Pretraining)
Then, we have a text encoder based on CLIP, which is used to extract embeddings from the textual description of the image. CLIP is a neural network that learns to associate images and their textual descriptions, jointly understanding the correlation between them. CLIP, which stands for Contrastive Language-Image Pretraining, is a language-vision model developed by OpenAI. The main goal of the CLIP model is to understand the semantic similarity of the image and its associated text. It's trained in a contrastive manner, where it learns to find associations between images and texts. It's trained on a wide range of diverse and unpaired data sources. This is what makes it different from traditional vision-language models that rely on paired image-text data for training.
Taken from CLIP paper
Path Aggregation Network (PAN)
The right information flow in the neural network is very crucial for its success. The Path Aggregation Network (PAN) does exactly that. It's a neural network that is built to make sure that the low-level and high-level features from an image are combined properly. It basically is a bottoms-up architecture that is used to mix the features from different levels of the image. This becomes very important for object detection problems, where object size can be as small as a few pixels to as big as the whole image. Imagine your model is trying to detect objects in a crowded scene. PAN creates multiple pathways, such that each path can focus on different aspects of the scene; some can look at the lower-level details, while others can look at the bigger picture. This is what makes PAN efficient in identifying objects. Inspired by this, YOLO-World reinvents the PAN architecture to make it more efficient and suitable for open-vocabulary object detection.
Taken from PAN paper
Let’s tie up everything together!
The YOLO-World architecture combines and takes all the above-mentioned ideas to a new level. YOLO-World initially starts with two parallel networks, one responsible for extracting visual features from the image and the other for extracting embeddings from the textual description of the image. The first network responsible for extracting visual features is DarkNet. The second network responsible for extracting embeddings from the textual description of the image and finally converting it to Vocabulary embeddings is the CLIP model. The Multi-Scale visual features extracted from DarkNet, along with the vocabulary embeddings from CLIP, are passed to the VL-PAN layer.
Taken from YOLO World Paper
Now, you must be wondering what this Re-parameterizable-VisionLanguage-PAN (RepVL-PAN) layer is. This novel network introduced by YOLO-World, inspired by PAN, fuses the multi-scale image features and vocabulary embeddings to understand the correlation and association between the image and its description (while training) or user prompt or user-defined category (while inferencing). It is composed of two main elements: text-guided cross-stage partial layers (T-CSPLayer) and image-pooling attention (IPA). T-CSPLayer is responsible for fusing the visual features and vocabulary embeddings, while IPA is responsible for generating image-aware embeddings.
Taken from YOLO World Paper
Now, remember we talked about region-text pairs? This is where they come into play. The image-aware text embeddings and object embeddings extracted from RepVL-PAN layers are used to create region-text pairs. These pairs are then used to find the similarity between the object in a region and its description. This is what they've called a contrastive head. Based on the similarity, along with non-max suppression, the model is able to identify the object in the image.
Pretty cool, right? It isn't over yet.
The neat trick is still to come. The model is trained in an open-vocabulary setting, but during inferencing, it uses an offline-vocabulary setting. This makes the model fast and suitable for real-world applications. This is what makes YOLO-World unique and powerful. It's a model that can see beyond labels.
While training, YOLO-World utilizes the online vocabulary setting, where the model is trained on a fixed set of online vocabulary generated from the nouns of the textual description available in the dataset.
During inferencing, they use something called prompt-then-detect, with an offline vocabulary, making it more efficient. Here, the user defines custom prompts or categories they want to detect from the image. This user input is then encoded using the text encoder, obtaining an offline vocabulary. This offline vocabulary is then used to detect the objects from the image. The offline vocabulary allows for avoiding computation for each input and provides the flexibility to adjust the vocabulary as needed. To know more about the reparameterization of VL-PAN, please refer to this paper.
Model Performance
Now that we've understood the architecture of the YOLO-World model, let's see how the performance metrics are as compared to other open-vocabulary models out there. The table below, taken from the paper, shows how YOLO-World outperforms other open-vocabulary models.
Taken from YOLO World Paper
Here are some of the results from the paper. They've shown the model's prediction using user-defined categories. The model can identify the objects in the image based on user-defined categories. This is what makes YOLO-World unique and powerful.
Taken from YOLO World Paper
YOLO-World doesn't stop here; it goes beyond showing how a user prompt that describes the image can be used to identify a specific object in the image. This is what makes YOLO-World unique and powerful. It truly can see beyond labels. Notice how it has identified ‘the person in red’ from the first image or ‘the brown animal’ in the second image.
Taken from YOLO World Paper
It’s very impressive to see that YOLO-World is taking object detection to a new level. Imagine the power of a multimodal agent while using YOLO-World as their vision model. You can ask questions from your LLM specifically related to a particular object, which would not be possible otherwise.
Let’s Code! Using YOLO-World Model to Identify Objects in Images
Enough of the theory; let's get our hands dirty and see how we can use the YOLO-World model to identify objects in images. We will be using the pre-trained model provided by Tencent AI Lab. We will be using the MIM library to install the YOLO-World model. This will make our lives easier, and we can focus on the fun part of the project. But to use this tutorial, you are required to have a GPU in your system, for which you can try a cloud platform like E2E.
E2E Networks
E2E Networks stands tall as the primary hyperscaler from India, supplying a compelling solution for AI and ML enthusiasts. E2E provides high-performance cloud GPU systems. Imagine tackling complex tasks like object detection with the raw power of NVIDIA A100/H100 GPUs – that's what E2E makes possible. Not only does E2E boast cutting-edge hardware, but also competitive pricing compared to global giants, making it an attractive option for cost-conscious developers. Beyond affordability, E2E is actively shaping the AI landscape in India. E2E is collaborating with research institutions and startups, fostering innovation, with the customizable cloud solutions catering to diverse needs. If you're looking for a powerful and accessible platform to push the boundaries of AI in India, look no further than E2E Networks. Check out the website to access the GPU-powered system.
Install the Dependencies
Let's start with installing all the dependencies. We're first going to clone two excellent repos from Onuralp SEZER, named MMYOLO and YOLO-World. MMYOLO is an open-source toolbox for YOLO series algorithms based on PyTorch and MMDetection. It is a part of the OpenMMLab project. And YOLO-World contains the PyTorch implementation, pre-trained weights, and pre-training/fine-tuning code for YOLO-World.
MIM provides a unified interface for launching and installing OpenMMLab projects and their extensions and managing the OpenMMLab model zoo. Now let's install the MIM package. MIM is a unified interface for launching and installing OpenMMLab projects and their extensions and managing the OpenMMLab model zoo. It is a part of the OpenMMLab project. We're going to use this library to set up our YOLO-World model. This will make our life easier and we can focus on the fun part of the project.
Now we need to restart the kernel before we can use any of these dependencies we just installed.
Download Model Weights and Image to Test on
Now we need to download the pre-trained weights for the YOLO-World model. We also need to download the image we want to test the model on. We're going to use the image of a person chasing a dog with several other objects in the background. Let's download the image and the weights.
Now that we’ve everything, let’s start with the actual implementation.
We first start by defining a few functions that we will be using to get the prediction from our model.
Now we set the config and model weights path to be used by our defined function.
Let’s see the magic!
Even though there are many objects present in the images, let's first start by checking if the model is able to detect the person in the image, leaving the rest of the objects.
Wonderful! The model is able to detect a person very efficiently based on our prompt. Now, let's see if it can detect other objects in the image, like a dog.
That's great. The model is able to detect dogs as well. Let's see if it can detect other objects in the image, like a car.
That's nice! Now, let's throw it all in one go and see if it can detect all the objects in the image.
Wonderful! The model is able to detect all the objects in the image. It's a very powerful model that can see beyond labels. It's a model that can identify any object you want, provided that you provide an image description of it.
Codebase
You can find the code used in this blog at the following GitHub repo:
References
- https://arxiv.org/pdf/2401.17270.pdf
- https://github.com/onuralpszr/YOLO-World
- https://arxiv.org/pdf/1612.08242.pdf
- https://arxiv.org/pdf/2103.00020.pdf
- https://github.com/open-mmlab/mmengine
- https://github.com/onuralpszr/mmyolo.git
- https://arxiv.org/pdf/1803.01534.pdf
- https://github.com/AILab-CVC/YOLO-World
- https://huggingface.co/spaces/stevengrove/YOLO-World