Step-by-Step Guide to Unlocking Open-Vocabulary Object Detection with YOLO-World

March 22, 2024

Why YOLO-World?

YOLO-World, developed by Tencent AI Lab - Computer Vision Center, is a novel object detection model that can identify objects in an offline vocabulary setting. It is a fusion of vision and language models, which can identify objects based on a textual description. In a nutshell, it fuses the features extracted from the vision model with those of embeddings extracted from the language model to understand the correlation between the image and its description. This type of fusion between text and image allows for the recognition of objects that are not present in the training data and offer a better understanding of the context of the image.

YOLO-World was developed to address the limitations of Fixed-Vocabulary Detectors and Detectors that use Online-Vocabulary during inferencing. What YOLO-World does is use open vocabulary instead of fixed vocabulary, but it doesn't stop there. It also uses offline vocabulary instead of online vocabulary during inferencing. Now, you must be wondering and confused with all these fixed, open, online, and offline vocabularies. Let me explain them to you.

‍

Taken from YOLO World Paper

Fixed Vocabulary

Fixed-Vocabulary Detectors can only identify objects that are present in the training data simply because they are trained on a fixed set of categories. New objects cannot be detected in the training data. These are basically traditional detectors that we use in everyday life. So, the biggest drawback is that the model can't find anything that isn't in the training data.

Open Vocabulary

Open-Vocabulary Detectors solve the problem we had previously with Fixed-Vocabulary Detectors. New categories not present in the training data can be identified. Generally, this is achieved via fusion with language/prompt encoders. It encodes the prompt given by the user and uses these embeddings along with features extracted from the image to identify the object.

Online Vocabulary

Online-Vocabulary Detectors use the open vocabulary settings we just saw, that is, basically encoding the prompt given by the user to create an open vocabulary and detect objects using these vocabulary words. But this, again, has a drawback. These types of models rely on heavy backbones to increase the open-vocabulary capacity. This makes the model heavy and slow.

Offline Vocabulary

So, the next logical step is to somehow make it lightning fast and yet possess the power of open vocabulary. How's that done? Offline Vocabulary! Well, you just train your model in the same online vocabulary setting, but while inferencing, you switch to an offline vocabulary setting. Sounds simple, right? This is what YOLO-World does. It uses an open-vocabulary setting during training and an offline-vocabulary setting during inferencing. This makes the model fast and suitable for real-world applications.

Let's now get into the details of the model architecture.

Instead of relying solely on the bounding boxes, YOLO-World uses something called region-text pairs. Imagine dividing an image into areas, each assigned a textual description that highlights specific features. This provides a deeper understanding of the whole image and its content placement.

YOLO-World essentially has three components. It's inspired by YOLO v8 and has DarkNet as the backbone, Path Aggregation Network (PAN), and Bounding Box Regression & Object Embeddings. Let's get into the details of each of these components.

DarkNet

DarkNet feature extractor, first proposed in the YOLO9000 paper, is a convolutional neural network that serves as the image encoder in the YOLO-World model. It is a 53-layer deep neural network that is trained on the COCO and ImageNet datasets. It is used to extract visual features from the input image. The DarkNet architecture is designed to be relatively lightweight while maintaining high performance. It consists of convolutional layers with residual connections. Originally, it was developed for image classification tasks but was later adapted to object detection tasks.

Taken from YOLOv3 Paper

CLIP (Contrastive Language-Image Pretraining)

Then, we have a text encoder based on CLIP, which is used to extract embeddings from the textual description of the image. CLIP is a neural network that learns to associate images and their textual descriptions, jointly understanding the correlation between them. CLIP, which stands for Contrastive Language-Image Pretraining, is a language-vision model developed by OpenAI. The main goal of the CLIP model is to understand the semantic similarity of the image and its associated text. It's trained in a contrastive manner, where it learns to find associations between images and texts. It's trained on a wide range of diverse and unpaired data sources. This is what makes it different from traditional vision-language models that rely on paired image-text data for training.

Taken from CLIP paper

Path Aggregation Network (PAN)

The right information flow in the neural network is very crucial for its success. The Path Aggregation Network (PAN) does exactly that. It's a neural network that is built to make sure that the low-level and high-level features from an image are combined properly. It basically is a bottoms-up architecture that is used to mix the features from different levels of the image. This becomes very important for object detection problems, where object size can be as small as a few pixels to as big as the whole image. Imagine your model is trying to detect objects in a crowded scene. PAN creates multiple pathways, such that each path can focus on different aspects of the scene; some can look at the lower-level details, while others can look at the bigger picture. This is what makes PAN efficient in identifying objects. Inspired by this, YOLO-World reinvents the PAN architecture to make it more efficient and suitable for open-vocabulary object detection.

‍

Taken from PAN paper

Let’s tie up everything together!

The YOLO-World architecture combines and takes all the above-mentioned ideas to a new level. YOLO-World initially starts with two parallel networks, one responsible for extracting visual features from the image and the other for extracting embeddings from the textual description of the image. The first network responsible for extracting visual features is DarkNet. The second network responsible for extracting embeddings from the textual description of the image and finally converting it to Vocabulary embeddings is the CLIP model. The Multi-Scale visual features extracted from DarkNet, along with the vocabulary embeddings from CLIP, are passed to the VL-PAN layer.

Taken from YOLO World Paper

Now, you must be wondering what this Re-parameterizable-VisionLanguage-PAN (RepVL-PAN) layer is. This novel network introduced by YOLO-World, inspired by PAN, fuses the multi-scale image features and vocabulary embeddings to understand the correlation and association between the image and its description (while training) or user prompt or user-defined category (while inferencing). It is composed of two main elements: text-guided cross-stage partial layers (T-CSPLayer) and image-pooling attention (IPA). T-CSPLayer is responsible for fusing the visual features and vocabulary embeddings, while IPA is responsible for generating image-aware embeddings.

Taken from YOLO World Paper

Now, remember we talked about region-text pairs? This is where they come into play. The image-aware text embeddings and object embeddings extracted from RepVL-PAN layers are used to create region-text pairs. These pairs are then used to find the similarity between the object in a region and its description. This is what they've called a contrastive head. Based on the similarity, along with non-max suppression, the model is able to identify the object in the image.

Pretty cool, right? It isn't over yet.

The neat trick is still to come. The model is trained in an open-vocabulary setting, but during inferencing, it uses an offline-vocabulary setting. This makes the model fast and suitable for real-world applications. This is what makes YOLO-World unique and powerful. It's a model that can see beyond labels.

While training, YOLO-World utilizes the online vocabulary setting, where the model is trained on a fixed set of online vocabulary generated from the nouns of the textual description available in the dataset.

During inferencing, they use something called prompt-then-detect, with an offline vocabulary, making it more efficient. Here, the user defines custom prompts or categories they want to detect from the image. This user input is then encoded using the text encoder, obtaining an offline vocabulary. This offline vocabulary is then used to detect the objects from the image. The offline vocabulary allows for avoiding computation for each input and provides the flexibility to adjust the vocabulary as needed. To know more about the reparameterization of VL-PAN, please refer to this paper.

Model Performance

Now that we've understood the architecture of the YOLO-World model, let's see how the performance metrics are as compared to other open-vocabulary models out there. The table below, taken from the paper, shows how YOLO-World outperforms other open-vocabulary models.

Taken from YOLO World Paper

Here are some of the results from the paper. They've shown the model's prediction using user-defined categories. The model can identify the objects in the image based on user-defined categories. This is what makes YOLO-World unique and powerful.

Taken from YOLO World Paper

YOLO-World doesn't stop here; it goes beyond showing how a user prompt that describes the image can be used to identify a specific object in the image. This is what makes YOLO-World unique and powerful. It truly can see beyond labels. Notice how it has identified ‘the person in red’ from the first image or ‘the brown animal’ in the second image.

Taken from YOLO World Paper

It’s very impressive to see that YOLO-World is taking object detection to a new level. Imagine the power of a multimodal agent while using YOLO-World as their vision model. You can ask questions from your LLM specifically related to a particular object, which would not be possible otherwise.

Let’s Code! Using YOLO-World Model to Identify Objects in Images

Enough of the theory; let's get our hands dirty and see how we can use the YOLO-World model to identify objects in images. We will be using the pre-trained model provided by Tencent AI Lab. We will be using the MIM library to install the YOLO-World model. This will make our lives easier, and we can focus on the fun part of the project. But to use this tutorial, you are required to have a GPU in your system, for which you can try a cloud platform like E2E.

E2E Networks

E2E Networks stands tall as the primary hyperscaler from India, supplying a compelling solution for AI and ML enthusiasts. E2E provides high-performance cloud GPU systems. Imagine tackling complex tasks like object detection with the raw power of NVIDIA A100/H100 GPUs – that's what E2E makes possible. Not only does E2E boast cutting-edge hardware, but also competitive pricing compared to global giants, making it an attractive option for cost-conscious developers. Beyond affordability, E2E is actively shaping the AI landscape in India. E2E is collaborating with research institutions and startups, fostering innovation, with the customizable cloud solutions catering to diverse needs. If you're looking for a powerful and accessible platform to push the boundaries of AI in India, look no further than E2E Networks. Check out the website to access the GPU-powered system.

Install the Dependencies

Let's start with installing all the dependencies. We're first going to clone two excellent repos from Onuralp SEZER, named MMYOLO and YOLO-World. MMYOLO is an open-source toolbox for YOLO series algorithms based on PyTorch and MMDetection. It is a part of the OpenMMLab project. And YOLO-World contains the PyTorch implementation, pre-trained weights, and pre-training/fine-tuning code for YOLO-World.

!pip install supervision==0.18.0 !pip install requests==2.28.2 tqdm==4.65.0 rich==13.4.2 !git clone https://github.com/onuralpszr/mmyolo.git -b version/mmcv %cd mmyolo/ !pip install -e . %cd /content !git clone --recursive https://github.com/onuralpszr/YOLO-World.git -b collab_friendly %cd YOLO-World/ !python setup.py build develop

MIM provides a unified interface for launching and installing OpenMMLab projects and their extensions and managing the OpenMMLab model zoo. Now let's install the MIM package. MIM is a unified interface for launching and installing OpenMMLab projects and their extensions and managing the OpenMMLab model zoo. It is a part of the OpenMMLab project. We're going to use this library to set up our YOLO-World model. This will make our life easier and we can focus on the fun part of the project.


%pip install -U openmim
!mim install "mmengine>=0.7.0"
!mim install "mmcv"

Now we need to restart the kernel before we can use any of these dependencies we just installed.

Download Model Weights and Image to Test on

Now we need to download the pre-trained weights for the YOLO-World model. We also need to download the image we want to test the model on. We're going to use the image of a person chasing a dog with several other objects in the background. Let's download the image and the weights.


!wget https://huggingface.co/spaces/stevengrove/YOLO-World/resolve/main/yolow-v8_l_clipv2_frozen_t2iv2_bn_o365_goldg_pretrain.pth?download=true
!wget https://huggingface.co/spaces/stevengrove/YOLO-World/resolve/main/configs/pretrain/yolo_world_l_t2i_bn_2e-4_100e_4x8gpus_obj365v1_goldg_train_lvis_minival.py?download=true
!mv yolow-v8_l_clipv2_frozen_t2iv2_bn_o365_goldg_pretrain.pth?download=true yolow-v8_l_clipv2_frozen_t2iv2_bn_o365_goldg_pretrain.pth
!mv yolo_world_l_t2i_bn_2e-4_100e_4x8gpus_obj365v1_goldg_train_lvis_minival.py?download=true yolo_world_l_t2i_bn_2e-4_100e_4x8gpus_obj365v1_goldg_train_lvis_minival.py
!cp -r yolo_world_l_t2i_bn_2e-4_100e_4x8gpus_obj365v1_goldg_train_lvis_minival.py /content/YOLO-World/configs/pretrain/
!wget -O car-chase.jpg https://github.com/quamernasim/YOLO-Wrold-See-Beyond-Labels/blob/main/car-chase-featured.jpg?raw=true

Now that we’ve everything, let’s start with the actual implementation.


import mmengine
import yolo_world
import mmyolo
import argparse
import os.path as osp
from functools import partial
import supervision as sv
import cv2
import torch
import numpy as np
from tempfile import NamedTemporaryFile
from PIL import Image
from torchvision.ops import nms
from mmengine.config import Config, DictAction
from mmengine.runner import Runner
from mmengine.runner.amp import autocast
from mmengine.dataset import Compose
from mmdet.visualization import DetLocalVisualizer
from mmdet.datasets import CocoDataset
from mmyolo.registry import RUNNERS

We first start by defining a few functions that we will be using to get the prediction from our model.


def setup_runner(cfg):
   "Sets up the runner from mmyolo library."
   # If runner_type is not specified, use the default runner
   if 'runner_type' not in cfg:
       runner = Runner.from_cfg(cfg)
   else:
       runner = RUNNERS.build(cfg)
   # Load the model and resume from the checkpoint
   runner.call_hook('before_run')
   runner.load_or_resume()
   # Set the pipeline and model to eval mode
   pipeline = cfg.test_dataloader.dataset.pipeline
   runner.pipeline = Compose(pipeline)
   runner.model.eval()
   # Adtionally, we will use the bounding box annotator and label annotator
   bounding_box_annotator = sv.BoundingBoxAnnotator()
   label_annotator = sv.LabelAnnotator(text_color=sv.Color.BLACK)
   return runner, bounding_box_annotator, label_annotator
def run_image(
   image: np.ndarray,
   text,
   cfg,
   max_num_boxes = 100,
   score_thr = 0.05,
   nms_thr = 0.5
): 
   "Runs the model on the given image and annotates the image with the predictions."
   # Set up the runner
   runner, bounding_box_annotator, label_annotator = setup_runner(cfg)
   # Save the image to a temporary file
   with NamedTemporaryFile(suffix=".jpeg") as f:
       # Save the image to the temporary file
       cv2.imwrite(f.name, image)
       # Split the prompt into texts to create an offline-vocabolary
       texts = [[t.strip()] for t in text.split(',')] + [[' ']]
       data_info = dict(img_id=0, img_path=f.name, texts=texts)
       # Use the runner pipeline from the mmyolo library to process the image and get it in batch
       data_info = runner.pipeline(data_info)
       data_batch = dict(inputs=data_info['inputs'].unsqueeze(0),
                         data_samples=[data_info['data_samples']])
       # Run the model on the image and get the predictions
       with autocast(enabled=False), torch.no_grad():
           output = runner.model.test_step(data_batch)[0]
           pred_instances = output.pred_instances
       # Apply NMS and score thresholding
       keep_idxs = nms(pred_instances.bboxes, pred_instances.scores, iou_threshold=nms_thr)
       pred_instances = pred_instances[keep_idxs]
       pred_instances = pred_instances[pred_instances.scores.float() > score_thr]
       if len(pred_instances.scores) > max_num_boxes:
           indices = pred_instances.scores.float().topk(max_num_boxes)[1]
           pred_instances = pred_instances[indices]
       pred_instances = pred_instances.cpu().numpy()
       # Create the detections object from supervision library and finally annotate the image
       detections = sv.Detections(
           xyxy=pred_instances['bboxes'],
           class_id=pred_instances['labels'],
           confidence=pred_instances['scores'],
           data={
               'class_name': np.array([texts[class_id][0] for class_id in pred_instances['labels']])
           }
       )
       labels = [
           f"{class_name} {confidence:0.2f}"
           for class_name, confidence
           in zip(detections['class_name'], detections.confidence)
       ]
       annotated_image = image.copy()
       annotated_image = bounding_box_annotator.annotate(annotated_image, detections)
       annotated_image = label_annotator.annotate(annotated_image, detections, labels)
       return annotated_image

Now we set the config and model weights path to be used by our defined function.


Cfg = Config.fromfile("/content/YOLO-World/configs/pretrain/yolo_world_l_t2i_bn_2e-4_100e_4x8gpus_obj365v1_goldg_train_lvis_minival.py")
cfg.work_dir = "."
cfg.load_from = "yolow-v8_l_clipv2_frozen_t2iv2_bn_o365_goldg_pretrain.pth"

Let’s see the magic!

Even though there are many objects present in the images, let's first start by checking if the model is able to detect the person in the image, leaving the rest of the objects.


class_names = "person"
image = run_image(cv2.imread('/content/car-chase.jpg') , class_names, cfg)
sv.plot_image(image)

Wonderful! The model is able to detect a person very efficiently based on our prompt. Now, let's see if it can detect other objects in the image, like a dog.


class_names = "dog"
image = run_image(cv2.imread('/content/car-chase.jpg'), class_names, cfg)
sv.plot_image(image)

That's great. The model is able to detect dogs as well. Let's see if it can detect other objects in the image, like a car.


class_names = "car"
image = run_image(cv2.imread('/content/car-chase.jpg'), class_names, cfg)
sv.plot_image(image)

That's nice! Now, let's throw it all in one go and see if it can detect all the objects in the image.


class_names = "car, dog, bicycle, person, nose, hair, bike"
image = run_image(cv2.imread('/content/car-chase.jpg'), class_names, cfg)
sv.plot_image(image)

Wonderful! The model is able to detect all the objects in the image. It's a very powerful model that can see beyond labels. It's a model that can identify any object you want, provided that you provide an image description of it.

Codebase

You can find the code used in this blog at the following GitHub repo:

GitHub - quamernasim/YOLO-Wrold-See-Beyond-Labels: Description of YOLO-World along with it's application

References

‍

Sign up for Free Trial

Latest Blogs

June 9, 2025

11 min read

Step-by-Step Guide to Unlocking Open-Vocabulary Object Detection with YOLO-World

Table of Contents

Why YOLO-World?

Fixed Vocabulary

Open Vocabulary

Online Vocabulary

Offline Vocabulary

Let's now get into the details of the model architecture.

DarkNet

CLIP (Contrastive Language-Image Pretraining)

Path Aggregation Network (PAN)

Let’s tie up everything together!

Model Performance

Let’s Code! Using YOLO-World Model to Identify Objects in Images

E2E Networks

Install the Dependencies

Download Model Weights and Image to Test on

Let’s see the magic!

Codebase

References

What is Retrieval-Augmented Generation (RAG)?

AI Inference vs Training: Understanding Key Differences

Sovereign Cloud: India's Key to Digital Independence in the AI Age

E2E Sovereign Cloud Platform: Revolutionizing Cloud Sovereignty

Top 8 Generative AI Applications in 2025

A Comparison between TIR Containerized VMs vs Traditional VMs

Accelerate Your AI Application Development Using TIR Containerized VMs

The AI Revolution in the Automotive Industry: Steering Toward a Smarter, Safer, and Sustainable Future

How to Build an AI Agent for Personalized Customer Experiences with LangGraph, LangChain and Gradio

Unleash Your AI Creativity at DeepSeek HackAIthon