Performance on triton with python backend
See original GitHub issueDescription Hello
- I was expecting that triton would be better than torchserve when I read some documentation. But in the images bellow, i see that: with the same model, the
throughput
of triton is lower the throughput of torchserve. Despite of I used tensorrt model in triton (the throughput of tensorrt model inference are faster than pytorch model inference)
- Could it be that was wrong at some point ?
Triton Information
Version triton 21.07
To Reproduce
This is what i did:
- In model repository folder, with using python backend: I add a model in model repos (creating
model.py
file, adding the weight of the tensorrt trained model here, creating config.pbtxt file, build stub and conda-pack my envs, …) - In
model.py
atTritonPythonModel
class:- I load pretrained model at
initialize
function - In
excuse
funcion, I implement my model inference and convert to format for triton response.
- I load pretrained model at
Expected behavior
Am I doing it right?. If I’m wrong can you point me in the right direction.
Issue Analytics
- State:
- Created 2 years ago
- Comments:6 (3 by maintainers)
Top Results From Across the Web
triton python backend load time of pytorch model is 4x slower ...
When I deploy the mdoel using python backend, the loading time is around 0.8 seconds. However, if I load the onnx model without...
Read more >1. Introduction — Poplar Triton Backend - Graphcore Documents
1.7.1. Triton performance analyzer and metrics ... For each batch of requests the Poplar backend will provide compute and execution times...
Read more >Solving AI Inference Challenges with NVIDIA Triton
Using the Python or C++ backends, you can define a custom script that can call any other model being served by Triton based...
Read more >Use Triton Inference Server with Amazon SageMaker
The Triton Python backend uses shared memory (SHMEM) to connect your code to Triton. SageMaker Inference provides up to half of the instance...
Read more >High-performance serving with Triton Inference Server (Preview)
Triton is multi-framework, open-source software that is optimized for inference. It supports popular machine learning frameworks like TensorFlow ...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
@gioipv Could you please share your
model.py
file. From your overall description, it seems like you are doing it correctly. What inferencing solution does “model inference” column use? Is it using the PyTorch backend?How did you measure the performance of your Python models? Did you use Perf Analyzer?
@gioipv Thanks for sharing the results.
Triton Python Client may add some latency because of using Python API which could be slower. Also, if you want to try concurrency values higher than 1 it would be harder to create the same scenario using Python client.