There are a lot of Machine Learning courses, and we are pretty good at modeling and improving our accuracy or other metrics. But a lot of us are getting in trouble outside the Jupyter/VS Code. There is a gap between our models and finalized business solution. And it doesn't matter how good our models are if they don't create value for the business. Finally, it is satisfying to have a fully working solution. That's why the topic of deployment is important. As a Computer Vision engineer, I would like to show an example of deployment and give a template, so you know where to start or at least you can see one way of how it can be done. The plan We are going to discuss Triton Server deployment. As a model example, I chose , converted to . As a hardware, I used . The pipeline runs on a test video. This example should easily transfer to other hardware and models. Note, for the model I recommend using , as it's a newer and better version of YOLOv5, though you might have issues with installing it on the old Jetson Nano. YOLOv5 TensorRT Nvidia Jetson Nano YOLOv8 Couple of words about key elements: YOLOv5/YOLOv8 - fast and accurate detector, v8 also has segmentation and classification models. TensorRT - model optimization for Nvidia GPUs, makes model 2–5 times faster and noticeably smaller on the inference (will need less VRAM for inference). Triton Server - inference serving software. It's like a backend where you run your models and process HTTP or gRPC requests with images. Nvidia Jetson - small edge computer with Nvidia GPU. Prepare Triton Server Firstly, you need to install everything that's needed for the inference. On different platforms that process is different, but here is what we need: Nvidia libs for deep learning (nvidia-drivers, cuda toolkit, cudnn) PyTorch Triton server YOLOv5 Secondly, train a model on a custom dataset and get weights with .pt file. You also can download pre-trained weights for a test run. The next step is to optimize your model with TensorRT. Here is an example on how to do that: python3 export.py --weights yolov5s.pt --include engine --imgsz 640 640 --device 0 # --half With you would use 16 bit precision instead of 32. Make sure to do this step , as one of the optimizations is tied to the exact GPU model. --half on your inferencing GPU We are finally getting to Triton Server. Let's create a folder structure like this one: -> model_repository ---> yolov5 -----> config.pbtxt -----> 1 -------> model.plan So, is a model name, is model version, is our exported model (just rename .engine to .plan) and finally is a config file, so Triton knows how to work with your model. Here is an example of the config: yolov5 1 model.plan config.pbtxt name: "yolov5" platform: "tensorrt_plan" max_batch_size: 1 input [ { name: "images" data_type: TYPE_FP32 dims: [ 3, 640, 640 ] } ] output [ { name: "output0" data_type: TYPE_FP32 dims: [ 25200, 85 ] } ] As we use TensorRT weights, we choose platform: "tensorrt_plan" As we did not export with half precision, we use and not data_type: TYPE_FP32 TYPE_FP16 We use input as our input has 3 channels (RGB) and image size is 640x640 dims: [ 3, 640, 640 ] As an output dims we used , according to . [ 25200, 85 ] yolo output shape Our model 80 classes + 4 box + 1 object confidence level outputs at each anchor, and there are 25200 anchors per image, so if you have for example 4 classes to detect, you should change 85 to 9. has The output looks like: x center, y center, width, height, object conf, class_1_conf, class_2_conf... With the config being ready, we should be able to start a Triton server with a command like this one: /home/argo/installation_triton/bin/tritonserver --model-repository=/home/argo/general_triton_yolo_pipeline/model_repository/ --backend-directory=/home/argo/installation_triton/backends Now we are ready to send our images to Triton server and get predictions. Prepare gRPC client Using model in a Triton server, we need to do all pre-processing and post-processing by ourselves. Here are the key things: Pre-processing: Letterbox (resize) RGB format Normalization Post-processing: Transform bbox values Non-max suppression Besides that, we need to create a connection with Triton server and send our batch for inference. Full client example you can find . Keep your eye on the function with configs, which should be the same as . here init config.pbtxt With all of that now we can send an image and get the prediction in a readable way. Main Pipeline All we need now is to create a higher-level pipeline with these functions: Read the frame Run detection Save the image if something was detected Here is the code for the : main.py import cv2 from pathlib import Path from src.yolov5_grpc import Yolov5_grpc from src.utils import fps_counter class Video_stream: def __init__(self, src): self.cap = cv2.VideoCapture(src) def read(self): ret, frame = self.cap.read() if ret: return frame class Pipeline: def __init__( self, src: str, detector_thres: float = 0.5, save_images: bool = False ): self.detector_thres = detector_thres self.save_images = save_images self.root_path = Path(__file__).parent.absolute() self.images_path_save = self.root_path / "images" self.camera = Video_stream(src) self.detector = Yolov5_grpc(conf_thresh=detector_thres) self.create_images_folder() self.idx = 0 self.running = True def create_images_folder(self): Path(self.images_path_save).mkdir(parents=True, exist_ok=True) def save_output(self, pred_frame): output_path = (self.images_path_save / f"image_{self.idx}").with_suffix(".jpeg") cv2.imwrite(str(output_path), pred_frame) @fps_counter def _runner(self): frame = self.camera.read() if frame is None: self.running = False return boxes, pred_frame, _ = self.detector.get_boxes_debug(frame) if boxes and self.save_images: self.save_output(pred_frame) self.idx += 1 def run(self): while self.running: self._runner() def main(): src = "test_vid.mp4" detector_thres = 0.7 save_images = True Pipeline(src, detector_thres, save_images).run() if __name__ == "__main__": main() Class reads the frame. Class creates a folder for images to save, grabs the frame, runs detection and finally saves the frame. With we can measure the speed of our system. Video_stream Pipeline @fps_counter In the main function, you can change the detector's threshold and choose another video to run your test on. What do we get at the end? We have created a simple pipeline for running YOLO models with TensorRT in Triton Server. That's a place to start when you need to deploy a computer vision model in the real world and be able to scale it easily. You can have several clients, and they will send requests with the image to one Triton server. You can also run several models in Triton server to make processing several clients faster or if you have different models for different tasks. Finally, you can find everything I was sharing in the . repository I highly recommend diving deeper in Triton Server, TensorRT, YOLOv8, and you also can read about as the next step in your deployment. DeepStream Update Dec 2023: You can take a look to the repo I shared, everything is updated to use YOLOv8.