Skip to main content

Deploy YOLOv8 on NVIDIA Jetson using TensorRT and DeepStream SDK Support

This guide explains how to deploy a trained AI model into NVIDIA Jetson Platform and perform inference using TensorRT and DeepStream SDK. Here we use TensorRT to maximize the inference performance on the Jetson platform.

Prerequisites

  • Ubuntu Host PC (native or VM using VMware Workstation Player)
  • reComputer Jetson or any other NVIDIA Jetson device running JetPack 4.6 or higher

DeepStream Version Corresponsing to JetPack Version

For YOLOv8 to work together with DeepStream, we are using this DeepStram-YOLO repository and it supports different versions of DeepStream. So make sure to use the correct version of JetPack according to the correct version of DeepStream.

DeepStream VersionJetPack Version
6.25.1.1
5.1
6.1.15.0.2
6.15.0.1 DP
6.0.14.6.3
4.6.2
4.6.1
6.04.6

To verify this wiki, we have installed DeepStream SDK 6.2 on a JetPack 5.1.1 system running on reComputer J4012.

Flash JetPack to Jetson

Now you need to make sure that the Jetson device is flashed with a JetPack system including SDK components such as CUDA, TensorRT, cuDNN and more. You can either use NVIDIA SDK Manager or command-line to flash JetPack to the device.

For Seeed Jetson-powered devices flashing guides, please refer to the below links:

Install DeepStream

There are multiple ways of installing DeepStream to the Jetson device. You can follow this guide to learn more. However, we recommend you to install DeepStream via the SDK Manager because it can guarantee for a successful and easy installation.

If you install DeepStream using SDK manager, you need to execute the below commands which are additional dependencies for DeepStream, after the system boots up

sudo apt install \
libssl1.1 \
libgstreamer1.0-0 \
gstreamer1.0-tools \
gstreamer1.0-plugins-good \
gstreamer1.0-plugins-bad \
gstreamer1.0-plugins-ugly \
gstreamer1.0-libav \
libgstreamer-plugins-base1.0-dev \
libgstrtspserver-1.0-0 \
libjansson4 \
libyaml-cpp-dev

Install Necessary Packages

  • Step 1. Access the terminal of Jetson device, install pip and upgrade it
sudo apt update
sudo apt install -y python3-pip
pip3 install --upgrade pip
  • Step 2. Clone the following repo
git clone https://github.com/ultralytics/ultralytics.git
  • Step 3. Open requirements.txt
cd ultralytics
vi requirements.txt
  • Step 4. Edit the following lines. Here you need to press i first to enter editing mode. Press ESC, then type :wq to save and quit
# torch>=1.7.0
# torchvision>=0.8.1

Note: torch and torchvision are excluded for now because they will be installed later.

  • Step 5. Install the necessary packages
pip3 install -r requirements.txt

If the installer complains about outdated python-dateutil package, upgrade it by

pip3 install python-dateutil --upgrade

Install PyTorch and Torchvision

We cannot install PyTorch and Torchvision from pip because they are not compatible to run on Jetson platform which is based on ARM aarch64 architecture. Therefore we need to manually install pre-built PyTorch pip wheel and compile/ install Torchvision from source.

Visit this page to access all the PyTorch and Torchvision links.

Here are some of the versions supported by JetPack 5.0 and above.

PyTorch v1.11.0

Supported by JetPack 5.0 (L4T R34.1.0) / JetPack 5.0.1 (L4T R34.1.1) / JetPack 5.0.2 (L4T R35.1.0) with Python 3.8

file_name: torch-1.11.0-cp38-cp38-linux_aarch64.whl URL: https://nvidia.box.com/shared/static/ssf2v7pf5i245fk4i0q926hy4imzs2ph.whl

PyTorch v1.12.0

Supported by JetPack 5.0 (L4T R34.1.0) / JetPack 5.0.1 (L4T R34.1.1) / JetPack 5.0.2 (L4T R35.1.0) with Python 3.8

file_name: torch-1.12.0a0+2c916ef.nv22.3-cp38-cp38-linux_aarch64.whl URL: https://developer.download.nvidia.com/compute/redist/jp/v50/pytorch/torch-1.12.0a0+2c916ef.nv22.3-cp38-cp38-linux_aarch64.whl

  • Step 1. Install torch according to your JetPack version in the following format
wget <URL> -O <file_name>
pip3 install <file_name>

For example, here we are running JP5.0.2 and therefore we choose PyTorch v1.12.0

sudo apt-get install -y libopenblas-base libopenmpi-dev
wget https://developer.download.nvidia.com/compute/redist/jp/v50/pytorch/torch-1.12.0a0+2c916ef.nv22.3-cp38-cp38-linux_aarch64.whl -O torch-1.12.0a0+2c916ef.nv22.3-cp38-cp38-linux_aarch64.whl
pip3 install torch-1.12.0a0+2c916ef.nv22.3-cp38-cp38-linux_aarch64.whl
  • Step 2. Install torchvision depending on the version of PyTorch that you have installed. For example, we chose PyTorch v1.12.0, which means, we need to choose Torchvision v0.13.0
sudo apt install -y libjpeg-dev zlib1g-dev
git clone --branch v0.13.0 https://github.com/pytorch/vision torchvision
cd torchvision
python3 setup.py install --user

Here is a list of the corresponding torchvision version that you need to install according to the PyTorch version:

  • PyTorch v1.11 - torchvision v0.12.0
  • PyTorch v1.12 - torchvision v0.13.0

If you want a more detailed list, please check this link.

DeepStream Configuration for YOLOv8

  • Step 1. Clone the following repo
cd ~
git clone https://github.com/marcoslucianops/DeepStream-Yolo
  • Step 2. Checkout the repo to the following commit
cd DeepStream-Yolo
git checkout 68f762d5bdeae7ac3458529bfe6fed72714336ca
  • Step 3. Copy gen_wts_yoloV8.py from DeepStream-Yolo/utils into ultralytics directory
cp utils/gen_wts_yoloV8.py ~/ultralytics
  • Step 4. Inside the ultralytics repo, download pt file from YOLOv8 releases (example for YOLOv8s)
wget https://github.com/ultralytics/assets/releases/download/v0.0.0/yolov8s.pt

NOTE: You can use your custom model, but it is important to keep the YOLO model reference (yolov8_) in your cfg and weights/wts filenames to generate the engine correctly.

  • Step 5. Generate the cfg, wts and labels.txt (if available) files (example for YOLOv8s)
python3 gen_wts_yoloV8.py -w yolov8s.pt

Note: To change the inference size (defaut: 640)

-s SIZE
--size SIZE
-s HEIGHT WIDTH
--size HEIGHT WIDTH

Example for 1280:

-s 1280
or
-s 1280 1280
  • Step 6. Copy the generated cfg, wts and labels.txt (if generated) files into the DeepStream-Yolo folder
cp yolov8s.cfg ~/DeepStream-Yolo
cp yolov8s.wts ~/DeepStream-Yolo
cp labels.txt ~/DeepStream-Yolo
  • Step 7. Open the DeepStream-Yolo folder and compile the library
cd ~/DeepStream-Yolo
CUDA_VER=11.4 make -C nvdsinfer_custom_impl_Yolo # for DeepStream 6.2/ 6.1.1 / 6.1
CUDA_VER=10.2 make -C nvdsinfer_custom_impl_Yolo # for DeepStream 6.0.1 / 6.0
  • Step 8. Edit the config_infer_primary_yoloV8.txt file according to your model (example for YOLOv8s with 80 classes)
[property]
...
custom-network-config=yolov8s.cfg
model-file=yolov8s.wts
...
num-detected-classes=80
...
  • Step 9. Edit the deepstream_app_config.txt file
...
[primary-gie]
...
config-file=config_infer_primary_yoloV8.txt
  • Step 10. Change the video source in deepstream_app_config.txt file. Here a default video file is loaded as you can see below
...
[source0]
...
uri=file:///opt/nvidia/deepstream/deepstream/samples/streams/sample_1080p_h264.mp4

Run the Inference

deepstream-app -c deepstream_app_config.txt

The above result is running on Jetson AGX Orin 32GB H01 Kit with FP32 and YOLOv8s 640x640. We can see that the FPS is around 60 and that is not the true FPS because when we set type=2 under [sink0] in deepstream_app_config.txt file, the FPS is limited to the fps of the monitor and the monitor we used for this testing is a 60Hz monitor. However, if you change this value to type=1, you will be able to obtain the maximum FPS, but there will be no live detection output.

For the same video source and the same model as used above, after changing type=1 under [sink0], the below result can be obtained.

As you can see, we can get an fps of about 139 which relates to the real fps value.

INT8 Calibration

If you want to use INT8 precision for inference, you need to follow the steps below

  • Step 1. Install OpenCV
sudo apt-get install libopencv-dev
  • Step 2. Compile/recompile the nvdsinfer_custom_impl_Yolo library with OpenCV support
cd ~/DeepStream-Yolo
CUDA_VER=11.4 OPENCV=1 make -C nvdsinfer_custom_impl_Yolo # for DeepStream 6.2/ 6.1.1 / 6.1
CUDA_VER=10.2 OPENCV=1 make -C nvdsinfer_custom_impl_Yolo # for DeepStream 6.0.1 / 6.0
  • Step 3. For COCO dataset, download the val2017, extract, and move to DeepStream-Yolo folder

  • Step 4. Make a new directory for calibration images

mkdir calibration
  • Step 5. Run the following to select 1000 random images from COCO dataset to run calibration
for jpg in $(ls -1 val2017/*.jpg | sort -R | head -1000); do \
cp ${jpg} calibration/; \
done

Note: NVIDIA recommends at least 500 images to get a good accuracy. On this example, 1000 images are chosen to get better accuracy (more images = more accuracy). Higher INT8_CALIB_BATCH_SIZE values will result in more accuracy and faster calibration speed. Set it according to you GPU memory. You can set it from head -1000. For example, for 2000 images, head -2000. This process can take a long time.

  • Step 6. Create the calibration.txt file with all selected images
realpath calibration/*jpg > calibration.txt
  • Step 7. Set environment variables
export INT8_CALIB_IMG_PATH=calibration.txt
export INT8_CALIB_BATCH_SIZE=1
  • Step 8. Update the config_infer_primary_yoloV8.txt file

From

...
model-engine-file=model_b1_gpu0_fp32.engine
#int8-calib-file=calib.table
...
network-mode=0
...

To

...
model-engine-file=model_b1_gpu0_int8.engine
int8-calib-file=calib.table
...
network-mode=1
...
  • Step 9. Before running the inference, set type=2 under [sink0] in deepstream_app_config.txt file as mentioned before to obtain the max fps performance.

  • Step 10. Run the inference

deepstream-app -c deepstream_app_config.txt

Here we get an FPS value of about 350!

Multistream Configuration

NVIDIA DeepStream allows you to easily setup multiple streams on a single configuration file to build multistream video analytics applications. We will demonstrate later in this wiki on how models with high FPS performance can really help with multistream applications along with some benchmarks.

Here we will take 9 streams as an example. We will be changing the deepstream_app_config.txt file.

  • Step 1. Inside the [tiled-display] section, change the rows and columns to 3 and 3 so that we can have a 3x3 grid with 9 streams
[tiled-display]
rows=3
columns=3
  • Step 2. Inside the [source0] section, set num-sources=9 and add more uri. Here we will simply duplicate the current example video file 8 times to make up 9 streams in total. However, you can change to different video streams according to your application
[source0]
enable=1
type=3
uri=file:///opt/nvidia/deepstream/deepstream/samples/streams/sample_1080p_h264.mp4
uri=file:///opt/nvidia/deepstream/deepstream/samples/streams/sample_1080p_h264.mp4
uri=file:///opt/nvidia/deepstream/deepstream/samples/streams/sample_1080p_h264.mp4
uri=file:///opt/nvidia/deepstream/deepstream/samples/streams/sample_1080p_h264.mp4
uri=file:///opt/nvidia/deepstream/deepstream/samples/streams/sample_1080p_h264.mp4
uri=file:///opt/nvidia/deepstream/deepstream/samples/streams/sample_1080p_h264.mp4
uri=file:///opt/nvidia/deepstream/deepstream/samples/streams/sample_1080p_h264.mp4
uri=file:///opt/nvidia/deepstream/deepstream/samples/streams/sample_1080p_h264.mp4
uri=file:///opt/nvidia/deepstream/deepstream/samples/streams/sample_1080p_h264.mp4
num-sources=9

Now if you run the application again with deepstream-app -c deepstream_app_config.txt command, you will see the following output

trtexec Tool

Included in the samples directory is a command-line wrapper tool called trtexec. trtexec is a tool to use TensorRT without having to develop your own application. The trtexec tool has three main purposes:

  • Benchmarking networks on random or user-provided input data.
  • Generating serialized engines from models.
  • Generating a serialized timing cache from the builder.

Here we can use trtexec tool to quickly benchmark the models with different parameter. But first of all, you need to have an onnx model and we can genrate this onnx model by using ultralytics yolov8.

  • Step 1. Build ONNX using:
yolo mode=export model=yolov8s.pt format=onnx
  • Step 1. Build engine file using trtexec as follows:
cd /usr/src/tensorrt/bin
./trtexec --onnx=<path_to_onnx_file> --saveEngine=<path_to_save_engine_file>

For example:

./trtexec --onnx=/home/nvidia/yolov8s.onnx --saveEngine=/home/nvidia/yolov8s.engine

This will output performance results as follows along with a generated .engine file. By default it will convert ONNX to an TensorRT optimized file in FP32 precision and you can see the output as follows

Here we can take the mean latency as 7.2ms which translates to 139FPS. This is the same performance we got in the previous DeepStream demo.

However, if you want INT8 precision which offers better performance, you can execute the above command as follows

./trtexec --onnx=/home/nvidia/yolov8s.onnx --int8 --saveEngine=/home/nvidia/yolov8s.engine 

Here we can take the mean latency as 3.2ms which translates to 313FPS.

YOLOv8 Benchmark Results

We have done performance benchmarks for different YOLOv8 models running on reComputer J4012, AGX Orin 32GB H01 Kit and reComputer J2021

To learn about more performance benchmarks we have done using YOLOv8 models, please check our blog.

Multistream Model Benchmarks

After running several deepstream applications on reComputer Jetson Orin series products, we have done benchmarks with the YOLOv8s models.

  • First, we have used a single AI model and run multiple streams on the same AI model
  • Second, we have used multiple AI models and run multiple streams on multiple AI models

All these benchmarks are carried out under the following conditions:

  • YOLOv8s 640x640 image input
  • Disable UI
  • Turn on max power and max performance mode

From these benchmarks, we can see that for the highest end Orin NX 16GB device with a single YOLOv8s model at INT8, you can use around 40 cameras at around 5fps and with multiple YOLOv8s models at INT8 for each stream, you can use around 11 cameras at around 15fps. For multi model applications, the number of cameras is less because of the RAM limitations on the device and each model takes up a substantial amount of RAM.

In summary, when operating an edge device with YOLOv8 model only without applications running, the Jetson Orin Nano 8GB can support 4-6 streams, whereas the Jetson Orin NX 16GB can manage 16-18 streams at maximum capacity. However, these numbers may decrease as RAM resources are utilized in real-world applications. Therefore, it's advisable to use these figures as guidelines and conduct your own tests under your specific conditions.

Resources

Tech Support & Product Discussion

Thank you for choosing our products! We are here to provide you with different support to ensure that your experience with our products is as smooth as possible. We offer several communication channels to cater to different preferences and needs.

Loading Comments...