In our journey through deep learning, our primary areas of exploration and emphasis are often the stages of data preprocessing and model training. For any machine learning workflow in production, model deployment and inference is an equally important phase. The quality and impact of AI on the business largely depend on the inference infrastructure. Take an example of a shopping website. Their recommendation engines are hit by thousands of requests per second. The company gets the customer experience it wants only when all ML systems are robust and capable of accepting and responding to each request as fast and efficiently as possible. Thus, they have state-of-the-art infrastructure for deploying and running their AI systems. The same applies to all businesses seeking to accelerate using AI solutions.
Inference is the crucial step that harnesses the power of AI models to produce actionable insights from raw data. Depending on the application, it can be real-time or offline inference. Recommendation systems and fraud detection systems require immediate responses and so does real-time inference. But systems like predictive maintenance do not require immediate output but need to feed large amounts of data and maximize throughput. Here offline or batch inference is used. Inference environments can vary from model to model. It also depends on the framework used and the deployment device. Hence, it is not possible to assume a generalized inference platform for all models, frameworks and devices.
NVIDIA Triton Inference Server
The problems we discussed above are largely solved by using a scalable AI inference platform that handles all frameworks and model types. The Triton Inference is one such solution which is a fast and scalable open-source AI inference application. It allows inference in any kind of CPU and GPU environment. It handles a wide range of frameworks including TensorRT, PyTorch, TensorFlow,
ONNX and Python, even integrating with Kubernetes and MLOps platforms. It allows maximum hardware utilization using ensemble models and concurrent executions.
Additionally, it boasts the capability of dynamic batching. This allows the Triton server to flexibly group incoming client requests, processing them in larger batches. This enhances model throughput and latency. Triton server thus makes standardized inference possible on cloud, edge devices or any platforms with any framework.
In this tutorial, we will explore NVIDIA Triton Inference on the E2E Cloud platform. E2E Cloud has native integration with Triton server to create and deploy models. Here we will deploy a YoloV5 model on the Triton inference server and perform the inference.
Prerequisites
For the tutorial, you require an account in E2E cloud. You should have access to TIR AI platform. Make sure git is installed in your system. We will have to get a YoloV5 model to deploy. You can find it in the official repository. Clone the repository.
NVIDIA Triton servers support formats like onnx. So, we must first convert a model to this format. Let’s export a pre-trained YoloV5 model in onnx format.
Creating the Model
The model has to be created with storage before deploying. The models in TIR are containers for sharing and using model weights. At the backend, model weights and other files are stored in E2E Object Storage (EOS) buckets. Navigate to the model storage tab in the inference section of TIR platform page and click the Create Model button. In the model types, select Triton. After creating the model, you will be able to see the model credentials.
We have created a model with a Triton backend and an object storage. The next step is to upload the weights. You can access E2E Object Storage (EOS) using any S3-compatible CLI or SDK. We recommend using Minio CLI.
We will be using the Mino CLI here. To set up Minio CLI, please run the following command. You can also refer the document to know the steps on how to run the command.
Here, FOLDER_NAME is the path to the onnx file and yolo-v5/yolo-v5-36cea6 is the name of the model and bucket storage I just created. Feel free to change the names as required. The model will be successfully uploaded to the bucket storage.
Deploying the Endpoint
Before moving to endpoints, create an authorization token in the API Tokens section. To create a model endpoint for our object detection model, go to the Model Endpoints and click on the Create Endpoint button. Select the GPU configuration required for your model and create an inference endpoint.
You have successfully deployed the YoloV5 model. Now, let’s see how we can get inference results.
Model Inference
We need to install all Python libraries for the Triton server. Ensure you install the same version used to create the model backend in the cloud. Here I am using version 2.31.0.
Create a Triton client. This client can be used to send requests to the server and receive responses. The tritonclient.http module is part of the Triton Inference Server client library, which provides a Python API for interacting with Triton servers.
If facing any SSL issues, you can try the same with this code snippet.
Replace the url with the endpoint URL from your model endpoints page.
Let’s test the endpoint using a sample image. Preprocess the image and convert it to a format expected by the YOLO model. For YOLOv5, the input size is typically 640x640. You might need to adjust this based on your model configuration.
Now create the model name and an InferInput object. 'input__0' is the name of the input for the YOLOv5 model.
Set the input data for the InferInput object.
Now, create an InferRequestedOutput object for each output of the model. 'output__0' is the name of the output for the YOLOv5 model.
Finally, send the image as an http request to the server and get the response from the model.
Wrapping Up
You have successfully deployed an object detection model on the NVIDIA Triton server using E2E Cloud. We encourage you to play around with other models too. Ideally you should use Docker, a popular platform that packages applications in containers. This allows for easier distribution and deployment of applications, including AI models. We hope you enjoyed this tutorial and found it useful for your projects. Thank you for reading and happy coding!