Using TensorRT
This document outlines the general process of AI inference acceleration with TensorRT on an OVP8xx device.
Building a TensorRT container
There are two options:
Use a base NVIDIA container and import the runtime libraries directly from the firmware. This is the preferred method that we will describe below.
Use a complete NVIDIA container that includes the TensorRT libraries directly. This is not recommended since containers sizes will increase dramatically.
NVIDIA base containers
NVIDIA provides L4T-based containers with TensorFlow that can be downloaded directly from their containers catalog. TensorFlow should be used with the corresponding recommended version of JetPack. The recommendations can be found on the TensorFlow for Jetson website.
Compatibility Matrix
VPU Hardware | VPU Firmware | L4T Version | Jetpack Version | Tensorflow | Pytorch | Machine learning |
---|---|---|---|---|---|---|
OVP81x | 1.10.13 | R32.7.5 | 4.6.5 |
nvcr.io/nvidia/l4t-tensorflow:r32.7.1-tf2.7-py3 nvcr.io/nvidia/l4t-tensorflow:r32.7.1-tf1.15-py3 |
nvcr.io/nvidia/l4t-pytorch:r32.7.1-pth1.10-py3 nvcr.io/nvidia/l4t-pytorch:r32.7.1-pth1.9-py3 |
nvcr.io/nvidia/l4t-ml:r32.7.1-py3 |
OVP81x | 1.4.30 | R32.7.3 | 4.6.3 |
nvcr.io/nvidia/l4t-tensorflow:r32.7.1-tf2.7-py3 nvcr.io/nvidia/l4t-tensorflow:r32.7.1-tf1.15-py3 |
nvcr.io/nvidia/l4t-pytorch:r32.7.1-pth1.10-py3 nvcr.io/nvidia/l4t-pytorch:r32.7.1-pth1.9-py3 |
nvcr.io/nvidia/l4t-ml:r32.7.1-py3 |
OVP80x | 1.4.32 | R32.4.3 | 4.4.0 |
nvcr.io/nvidia/l4t-tensorflow:r32.4.3-tf2.2-py3 nvcr.io/nvidia/l4t-tensorflow:r32.4.3-tf1.15-py3 |
nvcr.io/nvidia/l4t-pytorch:r32.4.3-pth1.6-py3 | nvcr.io/nvidia/l4t-ml:r32.4.3-py3 |
The underlying structure of the container loads the TensorRT libraries and is handled by NVIDIA and Docker - as long as the versions of the container and JetPack closely match.
Note
To access or modify the Dockerfiles and scripts used to build the NVIDIA containers, see this GitHub repository
Verify the functionality
Start an interactive session in the container and try to import torch in interactive python as shown below. This assumes that the containers has previously been deployed on to the VPU.
ovp81x-fc-6c-6d:~$ docker run -ti --runtime nvidia nvcr.io/nvidia/l4t-ml:r32.7.1-py3
allow 10 sec for JupyterLab to start @ http://172.17.0.2:8888 (password nvidia)
JupterLab logging location: /var/log/jupyter.log (inside the container)
root@89e52a1dfd4c:/# python3
Python 3.6.9 (default, Dec 8 2021, 21:08:43)
[GCC 8.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> torch.cuda.is_available()
True
>>> torch.cuda.device_count()
1
>>> torch.cuda.current_device()
0
>>> torch.cuda.get_device_name(0)
'NVIDIA Tegra X2'
>>>
To mount scripts, data, etc. from your Jetson’s filesystem to run inside the container, use Docker’s -v
flag when starting your Docker instance:
$ docker run -it --rm --runtime nvidia --network host -v /home/user/project:/location/in/container nvidia nvcr.io/nvidia/l4t-ml:r32.7.1-py3
Using TensorRT in a container on the VPU
TensorRT applications can be memory-intensive. Here’s how you can manage memory effectively:
Use
l4t-cuda-base
image and build TensorRT inside the container using Dockerfile. We recommend using Docker’s multistage build feature to reduce the size in the final container.Reduce the container mounting size by using .dockerignore file. Follow the Dockerfile best practices to minimize the number of layers and overall size.
Once the docker container is deployed on the VPU, you can proceed as follows:
Run TensorRT models using
trtexec
inside thel4t-base
container. This container will copy TensorRT from the host.trtexec
runs the model, filing it with random data for testing purposes. This is a first indication whether the adapted model can be run on the final architecture.Adapt the model for the final deployment architecture. This may involve updating the model based on its structure and the operators and layers used. Not all operators and model adaptations may be available in the OVP8xx JetPack version. You may need to update your model on your development machine, export a new ONNX model with opset 11 operators, and adapt it again. This could be an iterative process.
Adaptations for the OVP8xx architecture
The model has to be exported and adapted to the final deployment architecture. Refer to the NVIDIA documentation for this process. This adaptation must be done on the final deployment architecture. Compiling on similar architectures, like Jetson evaluation boards, will result in an incompatible instruction set for the OVP8xx architecture.
We recommend exporting the neural network model to an ONNX model. Adapting the model for the deployment architecture may require updates. This could be an iterative process to get the model running on the final architecture. Update your model on your development machine, export a new ONNX model with opset 11 operators, and test this update in Docker.
For ONNX exports with opset 11 settings and further ONNX operator support, refer to the official onnx-tensorrt
documentation.
Runtime inference cycle times
Adapting the model as described will result in a model with a specific runtime on the VPU. You may need to adjust for different model sizes and operations. Remember, the typical cycle time on a development machine may not accurately reflect the expected cycle times on OVP8xx (TX2/TX2-NX) hardware.
Calculating the inference on OVP81x VPU using YOLOv11 ONNX Model file
Pull the machine learning base image provided by NVIDIA
Create a YOLOv11 ONNX model file using python script
Load a pretrained YOLO model (recommended for training)
model = YOLO(“yolo11n.pt”)
Export the model to ONNX format
model.export(format=”onnx”, imgsz=[480,640])
Copy the Docker container and ONNX model file to VPU
Run the docker image in interactive mode
$ docker run --runtime nvidia -it --runtime nvidia --gpus all -v /path/to/your/model:/workspace/model nvcr.io/nvidia/l4t-ml:r32.7.1-py3
Example runs
Run the command inside a docker container to measure inference timings
$ usr/src/tensorrt/bin/trtexec --onnx=yolov11/yolov11n.onnx --verbose --fp16
Inference timings
Model |
Batch Size |
Precision |
Inference time |
---|---|---|---|
YOLOv11-N |
1 |
FP16 |
20.62 ms |
YOLOv11-M |
1 |
FP16 |
93.24 ms |
Deepstream-l4t
The Deepstream-l4t NGC container is used in this example.
Pull the Deepstream-l4t NGC container.
$ docker pull nvcr.io/nvidia/deepstream-l4t:5.1-21.02-samples
Verify the successful pull by listing the Docker images.
$ docker image ls REPOSITORY TAG IMAGE ID CREATED SIZE nvcr.io/nvidia/deepstream-l4t 5.1-21.02-samples 0ff77669c10 6 months ago 2.72GB
Start the container on the VPU: please replace the mounted volume directory with your directory of choice containing the ONNX model
$ docker container run -it --rm --net=host --runtime nvidia -v /home/jetsontx2/for_container/:/home/dl_models nvcr.io/nvidia/deepstream-l4t:5.1-21.02-samples bash
In the container, navigate to
/home/dl_models directory
and runtrtexec
with the following command:$ /usr/src/tensorrt/bin/trtexec --onnx=/home/dl_models/yolov4tiny_relu_best_ops12_fp32.onnx --fp16 --explicitBatch=1
Optimal performance is achieved by using fp16 (floating point 16) precision. For TX2 board, the compute capability is 6.2 (that is SM62 architecture), which does not have INT8 feature. The output of trtexec for Yolov4 Tiny network and fp16 precision is as below:
root@jetsontx2-desktop:/home/dl_models# /usr/src/tensorrt/bin/trtexec --onnx=/home/dl_models/yolov4tiny_relu_best_ops12_fp32.onnx --fp16 --explicitBatch=1 &&&& RUNNING TensorRT.trtexec # /usr/src/tensorrt/bin/trtexec --onnx=/home/dl_models/yolov4tiny_relu_best_ops12_fp32.onnx --fp16 --explicitBatch=1 [09/23/2021-10:20:45] [I] === Model Options === [09/23/2021-10:20:45] [I] Format: ONNX [09/23/2021-10:20:45] [I] Model: /home/dl_models/yolov4tiny_relu_best_ops12_fp32.onnx [09/23/2021-10:20:45] [I] Output: [09/23/2021-10:20:45] [I] === Build Options === [09/23/2021-10:20:45] [I] Max batch: explicit [09/23/2021-10:20:45] [I] Workspace: 16 MB [09/23/2021-10:20:45] [I] minTiming: 1 [09/23/2021-10:20:45] [I] avgTiming: 8 [09/23/2021-10:20:45] [I] Precision: FP32+FP16 [09/23/2021-10:20:45] [I] Calibration: [09/23/2021-10:20:45] [I] Safe mode: Disabled [09/23/2021-10:20:45] [I] Save engine: [09/23/2021-10:20:45] [I] Load engine: [09/23/2021-10:20:45] [I] Builder Cache: Enabled [09/23/2021-10:20:45] [I] NVTX verbosity: 0 [09/23/2021-10:20:45] [I] Inputs format: fp32:CHW [09/23/2021-10:20:45] [I] Outputs format: fp32:CHW [09/23/2021-10:20:45] [I] Input build shapes: model [09/23/2021-10:20:45] [I] Input calibration shapes: model [09/23/2021-10:20:45] [I] === System Options === [09/23/2021-10:20:45] [I] Device: 0 [09/23/2021-10:20:45] [I] DLACore: [09/23/2021-10:20:45] [I] Plugins: [09/23/2021-10:20:45] [I] === Inference Options === [09/23/2021-10:20:45] [I] Batch: Explicit [09/23/2021-10:20:45] [I] Input inference shapes: model [09/23/2021-10:20:45] [I] Iterations: 10 [09/23/2021-10:20:45] [I] Duration: 3s (+ 200ms warm up) [09/23/2021-10:20:45] [I] Sleep time: 0ms [09/23/2021-10:20:45] [I] Streams: 1 [09/23/2021-10:20:45] [I] ExposeDMA: Disabled [09/23/2021-10:20:45] [I] Spin-wait: Disabled [09/23/2021-10:20:45] [I] Multithreading: Disabled [09/23/2021-10:20:45] [I] CUDA Graph: Disabled [09/23/2021-10:20:45] [I] Skip inference: Disabled [09/23/2021-10:20:45] [I] Inputs: [09/23/2021-10:20:45] [I] === Reporting Options === [09/23/2021-10:20:45] [I] Verbose: Disabled [09/23/2021-10:20:45] [I] Averages: 10 inferences [09/23/2021-10:20:45] [I] Percentile: 99 [09/23/2021-10:20:45] [I] Dump output: Disabled [09/23/2021-10:20:45] [I] Profile: Disabled [09/23/2021-10:20:45] [I] Export timing to JSON file: [09/23/2021-10:20:45] [I] Export output to JSON file: [09/23/2021-10:20:45] [I] Export profile to JSON file: [09/23/2021-10:20:45] [I] ---------------------------------------------------------------- Input filename: /home/dl_models/yolov4tiny_relu_best_ops12_fp32.onnx ONNX IR version: 0.0.6 Opset version: 12 Producer name: pytorch Producer version: 1.8 Domain: Model version: 0 Doc string: ---------------------------------------------------------------- [09/23/2021-10:20:47] [W] [TRT] onnx2trt_utils.cpp:220: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32. [09/23/2021-10:20:47] [W] [TRT] onnx2trt_utils.cpp:246: One or more weights outside the range of INT32 was clamped [09/23/2021-10:20:47] [W] [TRT] onnx2trt_utils.cpp:246: One or more weights outside the range of INT32 was clamped [09/23/2021-10:20:47] [W] [TRT] onnx2trt_utils.cpp:246: One or more weights outside the range of INT32 was clamped [09/23/2021-10:20:47] [W] [TRT] onnx2trt_utils.cpp:246: One or more weights outside the range of INT32 was clamped [09/23/2021-10:20:47] [W] [TRT] Output type must be INT32 for shape outputs [09/23/2021-10:20:56] [I] [TRT] Some tactics do not have sufficient workspace memory to run. Increasing workspace size may increase performance, please check verbose output. [09/23/2021-10:24:32] [I] [TRT] Detected 1 inputs and 6 output network tensors. [09/23/2021-10:24:33] [I] Starting inference threads [09/23/2021-10:24:36] [I] Warmup completed 0 queries over 200 ms [09/23/2021-10:24:36] [I] Timing trace has 0 queries over 3.01861 s [09/23/2021-10:24:36] [I] Trace averages of 10 runs: [09/23/2021-10:24:36] [I] Average on 10 runs - GPU latency: 11.6003 ms - Host latency: 11.7851 ms (end to end 11.8375 ms, enqueue 6.83557 ms) [09/23/2021-10:24:36] [I] Average on 10 runs - GPU latency: 11.0905 ms - Host latency: 11.2746 ms (end to end 11.2852 ms, enqueue 6.02471 ms) [09/23/2021-10:24:36] [I] Average on 10 runs - GPU latency: 11.0689 ms - Host latency: 11.2532 ms (end to end 11.2637 ms, enqueue 5.55458 ms) [09/23/2021-10:24:36] [I] Average on 10 runs - GPU latency: 11.1319 ms - Host latency: 11.3166 ms (end to end 11.3275 ms, enqueue 6.30752 ms)