There are a lot of Machine Learning courses, and we are pretty good at modeling and improving our accuracy or other metrics. But a lot of us are getting in trouble outside the Jupyter/VS Code. There is a gap between our models and finalized business solution. And it doesn't matter how good our models are if they don't create value for the business. Finally, it is satisfying to have a fully working solution.
That's why the topic of deployment is important. As a Computer Vision engineer, I would like to show an example of deployment and give a template, so you know where to start or at least you can see one way of how it can be done.
We are going to discuss Triton Server deployment. As a model example, I chose YOLOv5, converted to TensorRT. As a hardware, I used Nvidia Jetson Nano. The pipeline runs on a test video. This example should easily transfer to other hardware and models. Note, for the model I recommend using YOLOv8, as it's a newer and better version of YOLOv5, though you might have issues with installing it on the old Jetson Nano.
Couple of words about key elements:
Firstly, you need to install everything that's needed for the inference. On different platforms that process is different, but here is what we need:
Secondly, train a model on a custom dataset and get weights with .pt file. You also can download pre-trained weights for a test run.
The next step is to optimize your model with TensorRT. Here is an example on how to do that:
python3 export.py --weights yolov5s.pt --include engine --imgsz 640 640 --device 0 # --half
With --half
you would use 16 bit precision instead of 32. Make sure to do this step on your inferencing GPU, as one of the optimizations is tied to the exact GPU model.
We are finally getting to Triton Server. Let's create a folder structure like this one:
-> model_repository
---> yolov5
-----> config.pbtxt
-----> 1
-------> model.plan
So, yolov5
is a model name, 1
is model version, model.plan
is our exported model (just rename .engine to .plan) and finally config.pbtxt
is a config file, so Triton knows how to work with your model. Here is an example of the config:
name: "yolov5"
platform: "tensorrt_plan"
max_batch_size: 1
input [
{
name: "images"
data_type: TYPE_FP32
dims: [ 3, 640, 640 ]
}
]
output [
{
name: "output0"
data_type: TYPE_FP32
dims: [ 25200, 85 ]
}
]
As we use TensorRT weights, we choose platform: "tensorrt_plan"
As we did not export with half precision, we use data_type: TYPE_FP32
and not TYPE_FP16
We use input dims: [ 3, 640, 640 ]
as our input has 3 channels (RGB) and image size is 640x640
As an output dims we used [ 25200, 85 ]
, according to yolo output shape.
Our model has 80 classes + 4 box + 1 object confidence level outputs at each anchor, and there are 25200 anchors per image, so if you have for example 4 classes to detect, you should change 85 to 9.
The output looks like:
x center, y center, width, height, object conf, class_1_conf, class_2_conf...
With the config being ready, we should be able to start a Triton server with a command like this one:
/home/argo/installation_triton/bin/tritonserver --model-repository=/home/argo/general_triton_yolo_pipeline/model_repository/ --backend-directory=/home/argo/installation_triton/backends
Now we are ready to send our images to Triton server and get predictions.
Using model in a Triton server, we need to do all pre-processing and post-processing by ourselves. Here are the key things:
Pre-processing:
Post-processing:
Besides that, we need to create a connection with Triton server and send our batch for inference. Full client example you can find here. Keep your eye on the init
function with configs, which should be the same as config.pbtxt
.
With all of that now we can send an image and get the prediction in a readable way.
All we need now is to create a higher-level pipeline with these functions:
Here is the code for the main.py:
import cv2
from pathlib import Path
from src.yolov5_grpc import Yolov5_grpc
from src.utils import fps_counter
class Video_stream:
def __init__(self, src):
self.cap = cv2.VideoCapture(src)
def read(self):
ret, frame = self.cap.read()
if ret:
return frame
class Pipeline:
def __init__(
self, src: str, detector_thres: float = 0.5, save_images: bool = False
):
self.detector_thres = detector_thres
self.save_images = save_images
self.root_path = Path(__file__).parent.absolute()
self.images_path_save = self.root_path / "images"
self.camera = Video_stream(src)
self.detector = Yolov5_grpc(conf_thresh=detector_thres)
self.create_images_folder()
self.idx = 0
self.running = True
def create_images_folder(self):
Path(self.images_path_save).mkdir(parents=True, exist_ok=True)
def save_output(self, pred_frame):
output_path = (self.images_path_save / f"image_{self.idx}").with_suffix(".jpeg")
cv2.imwrite(str(output_path), pred_frame)
@fps_counter
def _runner(self):
frame = self.camera.read()
if frame is None:
self.running = False
return
boxes, pred_frame, _ = self.detector.get_boxes_debug(frame)
if boxes and self.save_images:
self.save_output(pred_frame)
self.idx += 1
def run(self):
while self.running:
self._runner()
def main():
src = "test_vid.mp4"
detector_thres = 0.7
save_images = True
Pipeline(src, detector_thres, save_images).run()
if __name__ == "__main__":
main()
Class Video_stream
reads the frame. Class Pipeline
creates a folder for images to save, grabs the frame, runs detection and finally saves the frame. With @fps_counter
we can measure the speed of our system.
In the main function, you can change the detector's threshold and choose another video to run your test on.
We have created a simple pipeline for running YOLO models with TensorRT in Triton Server. That's a place to start when you need to deploy a computer vision model in the real world and be able to scale it easily.
You can have several clients, and they will send requests with the image to one Triton server. You can also run several models in Triton server to make processing several clients faster or if you have different models for different tasks. Finally, you can find everything I was sharing in the repository.
I highly recommend diving deeper in Triton Server, TensorRT, YOLOv8, and you also can read about DeepStream as the next step in your deployment.
Update Dec 2023: You can take a look to the repo I shared, everything is updated to use YOLOv8.