Deploy Computer Vision Models with Triton Inference Server

There are a lot of Machine Learning courses, and we are pretty good at modeling and improving our accuracy or other metrics. But a lot of us are getting in trouble outside the Jupyter/VS Code. There is a gap between our models and finalized business solution. And it doesn't matter how good our models are if they don't create value for the business. Finally, it is satisfying to have a fully working solution.

That's why the topic of deployment is important. As a Computer Vision engineer, I would like to show an example of deployment and give a template, so you know where to start or at least you can see one way of how it can be done.

The plan

We are going to discuss Triton Server deployment. As a model example, I chose YOLOv5, converted to TensorRT. As a hardware, I used Nvidia Jetson Nano. The pipeline runs on a test video. This example should easily transfer to other hardware and models. Note, for the model I recommend using YOLOv8, as it's a newer and better version of YOLOv5, though you might have issues with installing it on the old Jetson Nano.

Couple of words about key elements:

YOLOv5/YOLOv8 - fast and accurate detector, v8 also has segmentation and classification models.
TensorRT - model optimization for Nvidia GPUs, makes model 2–5 times faster and noticeably smaller on the inference (will need less VRAM for inference).
Triton Server - inference serving software. It's like a backend where you run your models and process HTTP or gRPC requests with images.
Nvidia Jetson - small edge computer with Nvidia GPU.

Prepare Triton Server

Firstly, you need to install everything that's needed for the inference. On different platforms that process is different, but here is what we need:

Nvidia libs for deep learning (nvidia-drivers, cuda toolkit, cudnn)
PyTorch
Triton server
YOLOv5

Secondly, train a model on a custom dataset and get weights with .pt file. You also can download pre-trained weights for a test run.

The next step is to optimize your model with TensorRT. Here is an example on how to do that:

python3 export.py --weights yolov5s.pt --include engine --imgsz 640 640 --device 0 # --half

With --half you would use 16 bit precision instead of 32. Make sure to do this step on your inferencing GPU, as one of the optimizations is tied to the exact GPU model.

We are finally getting to Triton Server. Let's create a folder structure like this one:

-> model_repository
---> yolov5
-----> config.pbtxt
-----> 1
-------> model.plan

So, yolov5 is a model name, 1 is model version, model.plan is our exported model (just rename .engine to .plan) and finally config.pbtxt is a config file, so Triton knows how to work with your model. Here is an example of the config:

name: "yolov5"
platform: "tensorrt_plan"
max_batch_size: 1
input [
  {
    name: "images"
    data_type: TYPE_FP32
    dims: [ 3, 640, 640 ]
  }
]
output [
  {
    name: "output0"
    data_type: TYPE_FP32
    dims: [ 25200, 85 ]
  }
]

As we use TensorRT weights, we choose platform: "tensorrt_plan"

As we did not export with half precision, we use data_type: TYPE_FP32 and not TYPE_FP16

We use input dims: [ 3, 640, 640 ] as our input has 3 channels (RGB) and image size is 640x640

As an output dims we used [ 25200, 85 ], according to yolo output shape.

Our model has 80 classes + 4 box + 1 object confidence level outputs at each anchor, and there are 25200 anchors per image, so if you have for example 4 classes to detect, you should change 85 to 9.

The output looks like:

x center, y center, width, height, object conf, class_1_conf, class_2_conf...

With the config being ready, we should be able to start a Triton server with a command like this one:

/home/argo/installation_triton/bin/tritonserver --model-repository=/home/argo/general_triton_yolo_pipeline/model_repository/ --backend-directory=/home/argo/installation_triton/backends

Now we are ready to send our images to Triton server and get predictions.

Prepare gRPC client

Using model in a Triton server, we need to do all pre-processing and post-processing by ourselves. Here are the key things:

Pre-processing:

Letterbox (resize)
RGB format
Normalization

Post-processing:

Transform bbox values
Non-max suppression

Besides that, we need to create a connection with Triton server and send our batch for inference. Full client example you can find here. Keep your eye on the init function with configs, which should be the same as config.pbtxt.

With all of that now we can send an image and get the prediction in a readable way.

Main Pipeline

All we need now is to create a higher-level pipeline with these functions:

Read the frame
Run detection
Save the image if something was detected

Here is the code for the main.py:

import cv2
from pathlib import Path

from src.yolov5_grpc import Yolov5_grpc
from src.utils import fps_counter


class Video_stream:
    def __init__(self, src):
        self.cap = cv2.VideoCapture(src)

    def read(self):
        ret, frame = self.cap.read()
        if ret:
            return frame


class Pipeline:
    def __init__(
        self, src: str, detector_thres: float = 0.5, save_images: bool = False
    ):
        self.detector_thres = detector_thres
        self.save_images = save_images
        self.root_path = Path(__file__).parent.absolute()
        self.images_path_save = self.root_path / "images"

        self.camera = Video_stream(src)
        self.detector = Yolov5_grpc(conf_thresh=detector_thres)
        self.create_images_folder()
        self.idx = 0
        self.running = True

    def create_images_folder(self):
        Path(self.images_path_save).mkdir(parents=True, exist_ok=True)

    def save_output(self, pred_frame):
        output_path = (self.images_path_save / f"image_{self.idx}").with_suffix(".jpeg")
        cv2.imwrite(str(output_path), pred_frame)

    @fps_counter
    def _runner(self):
        frame = self.camera.read()
        if frame is None:
            self.running = False
            return

        boxes, pred_frame, _ = self.detector.get_boxes_debug(frame)
        if boxes and self.save_images:
            self.save_output(pred_frame)

        self.idx += 1

    def run(self):
        while self.running:
            self._runner()


def main():
    src = "test_vid.mp4"
    detector_thres = 0.7
    save_images = True

    Pipeline(src, detector_thres, save_images).run()


if __name__ == "__main__":
    main()

Class Video_stream reads the frame. Class Pipeline creates a folder for images to save, grabs the frame, runs detection and finally saves the frame. With @fps_counter we can measure the speed of our system.

In the main function, you can change the detector's threshold and choose another video to run your test on.

What do we get at the end?

We have created a simple pipeline for running YOLO models with TensorRT in Triton Server. That's a place to start when you need to deploy a computer vision model in the real world and be able to scale it easily.

You can have several clients, and they will send requests with the image to one Triton server. You can also run several models in Triton server to make processing several clients faster or if you have different models for different tasks. Finally, you can find everything I was sharing in the repository.

I highly recommend diving deeper in Triton Server, TensorRT, YOLOv8, and you also can read about DeepStream as the next step in your deployment.

Update Dec 2023: You can take a look to the repo I shared, everything is updated to use YOLOv8.