Inference is slow on Jetson Xavier NX
See original GitHub issueI am running the following test.py
script to benchmark the inference. For resnet50
FPS supposed to be ~312, but I get ~68.
import torch
import torchvision.models as models
import numpy as np
from time import time
from torch2trt import torch2trt
def inference_test():
device = torch.device('cuda:0')
# Create model and input.
model = models.resnet50(pretrained=True)
tmp = (np.random.standard_normal([1, 3, 224, 224]) * 255).astype(np.uint8)
# tmp = (np.random.standard_normal([1, 3, 416, 416]) * 255).astype(np.uint8) #mobilenet_v2
# move them to the device
model.eval()
model.to(device)
img = torch.from_numpy(tmp.astype(np.float32)).to(device)
# convert to TensorRT feeding sample data as input
model_trt = torch2trt(model, [img])
def infer():
with torch.no_grad():
before = time()
# outs = model(img)
outs = model_trt(img)
infer_time = time() - before
return infer_time
print("Running warming up iterations..")
for i in range(0, 100):
infer()
total_infer_time = 0
print("Running the test iterations..")
for i in range(0, 100):
total_infer_time += infer()
print(f"FPS: {100 / total_infer_time}")
inference_test()
OUTPUT:
Running warming up iterations..
Running the test iterations..
FPS: 67.9085001712161
Jetson env:
- NVIDIA Jetson Xavier NX (Developer Kit Version)
* Jetpack 4.4 [L4T 32.4.3]
* NV Power Mode: MODE_15W_6CORE - Type: 2
* jetson_stats.service: active
- Libraries:
* CUDA: 10.2.89
* cuDNN: 8.0.0.180
* TensorRT: 7.1.3.0
* Visionworks: 1.6.0.501
* OpenCV: 4.1.1 compiled CUDA: NO
* VPI: 0.3.7
* Vulkan: 1.2.70
$ sudo jetson_clocks --show
SOC family:tegra194 Machine:NVIDIA Jetson Xavier NX Developer Kit
Online CPUs: 0-5
CPU Cluster Switching: Disabled
cpu0: Online=1 Governor=schedutil MinFreq=1420800 MaxFreq=1420800 CurrentFreq=1420800 IdleStates: C1=0 c6=0
cpu1: Online=1 Governor=schedutil MinFreq=1420800 MaxFreq=1420800 CurrentFreq=1420800 IdleStates: C1=0 c6=0
cpu2: Online=1 Governor=schedutil MinFreq=1420800 MaxFreq=1420800 CurrentFreq=1420800 IdleStates: C1=0 c6=0
cpu3: Online=1 Governor=schedutil MinFreq=1420800 MaxFreq=1420800 CurrentFreq=1420800 IdleStates: C1=0 c6=0
cpu4: Online=1 Governor=schedutil MinFreq=1420800 MaxFreq=1420800 CurrentFreq=1420800 IdleStates: C1=0 c6=0
cpu5: Online=1 Governor=schedutil MinFreq=1420800 MaxFreq=1420800 CurrentFreq=1420800 IdleStates: C1=0 c6=0
GPU MinFreq=1109250000 MaxFreq=1109250000 CurrentFreq=1109250000
EMC MinFreq=204000000 MaxFreq=1600000000 CurrentFreq=1600000000 FreqOverride=1
Fan: speed=130
NV Power Mode: MODE_15W_6CORE
torch version: 1.6
torchvision version: 0.7.0
Please help me to find out the issue here? Thanks.
Issue Analytics
- State:
- Created 3 years ago
- Comments:7
Top Results From Across the Web
Yolov6 Slow inference speed on the Nvidia Jetson NX board
Hi Jetson community, I changed the yolov6 code to be able to use my intel realsense camera as input source like image and...
Read more >NVIDIA Jetson Xavier - Maximizing Performance - RidgeRun
Xavier provides the jetson_clocks script to maximize Jetson Xavier performance by setting static max frequency to CPU, GPU, and EMC clocks. The ...
Read more >Make the Most of Your Jetson's Computing Power for ... - Deci AI
Let's take a look at the latency that can be achieved on Jetson Xavier NX in two power modes by a few architectures...
Read more >Benchmarking YoloV4 Models on an Nvidia Jetson Xavier NX
Note, that this also means that inference itself is much slower than if executed with no context. Further, in-depth research on individual bottlenecks...
Read more >Benchmarking Jetson Nano, Jetson Xavier NX and RPi with ...
Download scientific diagram | Benchmarking Jetson Nano, Jetson Xavier NX and RPi with ... i.e., the more complex the model is, the slower...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Thanks a lot. It means I need tensorRT_version >= 7.0, and the device supports INT8 inference, just like Xavier NX.
Hi there! I tried it on my Xavier NX. I got the following results:
TRT + fp32 = ~61 fps TRT + fp16 = ~220 fps TRT + int8 = ~340 fps
I think the Nvidia’s claimed speed is for quantised model.