question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[Libtorch] Triton server produces inconsistent results when hosting multiple models in one GPU

See original GitHub issue

Description With the triton server hosting multiple model on one GPU, we see inconsistent result when GPU is heavily used (same inputs, different and incorrect outputs).

We created a minimal example in the following repository to reproduce this issue: https://github.com/zmy1116/tritoninferenceissues In this example, we create two TensorRT models from TorchVision’s resnet18 and inceptionv3 using NVIDIA torch2trt. We launch 4 inference jobs for each of the two models with the same inputs. There is significant discrepancy in the results using the same model on the same input.

So far in our test, this error seems only happen when multiple models are running within the same GPU heavily. This issues seems to not model dependent. We have seen the issues on our production models with various backbone:

  • resnet
  • inceptionnet
  • efficientnet
  • lstm

Triton Information What version of Triton are you using? We use triton 2.5.0 Are you using the Triton container or did you build it yourself? We directly use the NGC container nvcr.io/nvidia/tritonserver:20.11-py3

Aditional system information

The problem occure on AWS VM instance type g4dn.2xlarge It contains one T4 GPU. The models in model repository are all TensoRT model, created in the correspondent NGC container nvcr.io/nvidia/tensorrt:20.11-py3

To Reproduce

We created a minimal example in the following repository to reproduce this issue: https://github.com/zmy1116/tritoninferenceissues

To summarize the steps:

# assume the model repository is at /home/ubuntu/trt_issues_data/model_repository_2011
docker run -d --gpus=all --rm --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 -p8000:8000 -p8001:8001 -p8002:8002 -v /home/ubuntu/trt_issues_data/model_repository_2011:/models -e CUDA_VISIBLE_DEVICES=0 nvcr.io/nvidia/tritonserver:20.11-py3 tritonserver --model-repository=/models --strict-model-config=false
  • use python grpc client to invoke 2 models multiple times and store multiple runs results. For one T4 machine, I run:
    • each model 4 jobs
    • each job runs 64 times of inputs with size 64x3x224x224
nohup python /home/ubuntu/tritoninferenceissues/triton_inference.py resnet18 /home/ubuntu/trt_issues_data/testing_inputs.p 64 outputs1.p &
nohup python /home/ubuntu/tritoninferenceissues/triton_inference.py inceptionv3 /home/ubuntu/trt_issues_data/testing_inputs.p 64 outputs5.p &
nohup python /home/ubuntu/tritoninferenceissues/triton_inference.py resnet18 /home/ubuntu/trt_issues_data/testing_inputs.p 64 outputs2.p &
nohup python /home/ubuntu/tritoninferenceissues/triton_inference.py inceptionv3 /home/ubuntu/trt_issues_data/testing_inputs.p 64 outputs6.p &
nohup python /home/ubuntu/tritoninferenceissues/triton_inference.py resnet18 /home/ubuntu/trt_issues_data/testing_inputs.p 64 outputs3.p &
nohup python /home/ubuntu/tritoninferenceissues/triton_inference.py inceptionv3 /home/ubuntu/trt_issues_data/testing_inputs.p 64 outputs7.p &
nohup python /home/ubuntu/tritoninferenceissues/triton_inference.py resnet18 /home/ubuntu/trt_issues_data/testing_inputs.p 64 outputs4.p &
nohup python /home/ubuntu/tritoninferenceissues/triton_inference.py inceptionv3 /home/ubuntu/trt_issues_data/testing_inputs.p 64 outputs8.p &

I think the important thing here is to run GPU intensively enough (so that GPU reaches at least 80%+ utils).

In the above specific setup, outputs 1-4 are from model resnet18, output 5-8 are from model inceptionv3, since i passed the same inputs they should return the same outputs. However, when comparing the results you should see signifcant discrepancies:

f1 = pickle.load(open('/home/ubuntu/outputs1.p','rb'))
f2 = pickle.load(open('/home/ubuntu/outputs2.p','rb'))
f3 = pickle.load(open('/home/ubuntu/outputs3.p','rb'))
f4 = pickle.load(open('/home/ubuntu/outputs4.p','rb'))

f5 = pickle.load(open('/home/ubuntu/outputs5.p','rb'))
f6 = pickle.load(open('/home/ubuntu/outputs6.p','rb'))
f7 = pickle.load(open('/home/ubuntu/outputs7.p','rb'))
f8 = pickle.load(open('/home/ubuntu/outputs8.p','rb'))
for entry in [f2,f3,f4]:
    print(np.max(np.abs(f1-entry)))
for entry in [f7,f6,f8]:
    print(np.max(np.abs(f5-entry)))

error

You can certainly generate the inputs data TensorRT models yourself, the inputs data is just a random generated array of size 64x3x224x224, the two models are from torchvision directly


import torch
import torchvision

model = torchvision.models.resnet18(pretrained=True).cuda().half().eval()
data = torch.randn((1, 3, 224, 224)).cuda().half()
with open('/workspace/ubuntu/model_repository_2011/resnet18/1/model', "wb") as f:
    f.write(model_trt.engine.serialize())

model = torchvision.models.inception_v3(pretrained=True).cuda().half().eval()
data = torch.randn((1, 3, 224, 224)).cuda().half()
with open('/workspace/ubuntu/model_repository_2011/inceptionv3/1/model', "wb") as f:
    f.write(model_trt.engine.serialize())

Expected behavior

I think we should expect for the same model (of course, that is supposed to give deterministics result) should return correct and consistent result for the same inputs.

What we see in our actual work is that the discrepancies are significantly large. The more models we host, the larger magnitude the errors are. Without using TritonServer, we can confirm that these models produce consistent results when running only one model on GPU.

Let me know if you need additional information.

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Reactions:4
  • Comments:18 (14 by maintainers)

github_iconTop GitHub Comments

2reactions
CoderHamcommented, Feb 4, 2021

@huntrax11 you should try with the traced model instead of scripted and confirm the same. Sometimes the scripted model does not perform as expected. Additionally check with the 20.12 release container.

1reaction
CoderHamcommented, Mar 12, 2021

@huntrax11 please repeat your experiments with the latest release and let me know if you still see this issue.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Serve multiple models with Amazon SageMaker and Triton ...
The Triton server loads multiple models and exposes ports 8000, 8001, and 8002 as gRPC, HTTP, and metrics server.
Read more >
Stable Diffusion with Core ML on Apple Silicon | Hacker News
The repo is aimed at developers and has two parts. The first adapts the ML model to run on Apple Silicon (CPU, GPU,...
Read more >
Vergleich und Modellierung der Leistung verschiedener ML ...
The execution of the model for multiple input data samples can be ... According to its official website, “NVIDIA Triton™ Inference Server is...
Read more >
OpenGATE Collaboration - GATE documentation
Monte Carlo simulation is an essential tool in emission tomography to assist in the design of ... and a value of 2 results...
Read more >
DUNE Offline Computing Conceptual Design Report
1.3.2 Other phenomena – Solar Neutrinos and Beyond-the-Standard-Model ... algorithms were also in place on time and produced early results that led to ......
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found