Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Triton server is slower than pytorch model

See original GitHub issue

Description I have converted model with Torchscript format I call Triton server using Grpc client. The Model has been synthesized successfully. But the speed of the model in Triton server is very slow, when it only synthesizes 1 short sentence with 4 words, with Pytorch model it takes less than 0.1 seconds, however with Triton server, the calculation time is about 1.5 seconds.

Triton Information What version of Triton are you using? r21.04, server version 2.7.0 , cuda_11.2, GPU device:RTX 2080 Ti

To Reproduce Steps to reproduce the behavior.

Describe the models (framework, inputs, outputs), ideally include the model configuration file (if using an ensemble include the model configuration file for that as well).

name: "FastPitch"
platform: "pytorch_libtorch"
default_model_filename: "model.pt"

max_batch_size: 8

input {
    name: "INPUT__0"
    data_type: TYPE_INT64
    dims: -1
}
input {
    name: "INPUT__1"
    data_type: TYPE_INT64
    dims: 1
}
output {
    name: "OUTPUT__0"
    data_type: TYPE_FP16
    dims: 80
    dims: -1
}
output {
    name: "OUTPUT__1"
    data_type: TYPE_INT64
    dims: 1
    reshape {
    }
}
output {
    name: "OUTPUT__2"
    data_type: TYPE_FP16
    dims: -1
}
output {
    name: "OUTPUT__3"
    data_type: TYPE_FP16
    dims: -1
}

dynamic_batching {
    preferred_batch_size: [ 4, 8 ]
}

instance_group {
    count: 1
    gpus: 0
    kind: KIND_GPU
}

Triton server’s log verbose:

I0908 08:50:40.641507 2516 libtorch.cc:1095] model FastPitch, instance FastPitch_0, executing 1 requests
I0908 08:50:40.641566 2516 libtorch.cc:504] TRITONBACKEND_ModelExecute: Running FastPitch_0 with 1 requests
I0908 08:50:42.023955 2516 infer_response.cc:165] add response output: output: OUTPUT__0, type: FP16, shape: [1,80,81]

Expected behavior I have tried many times but can’t fix it.

Issue Analytics

State:
Created 2 years ago
Comments:6 (3 by maintainers)

Top GitHub Comments

1reaction

guyqazcommented, Oct 14, 2021

sorry for the late reply, i will recalculate the result

0reactions

CoderHamcommented, Nov 15, 2021

@guyqaz closing due to inactivity. I can re-open once you provide us with additional reproduction steps for the issue.

Top Results From Across the Web

[FastPitch/Triton] Triton server is slower than pytorch model

The Model has been synthesized successfully. But the speed of the model in Triton server is very slow, when it only synthesizes 1...

Latest Triton Inference Server - archived topics

Topic Replies Views Activity Mask RCNN TensorRT in Triton 0 461 July 9, 2020 NVlink support issues · hw , board‑design 0 188 July 6,...

Serving TensorRT Models with NVIDIA Triton Inference Server

Triton TensorRT is Slower than Local TensorRT. Before we end the article, one caveat I have to mention is that Triton server really...

Deploying a PyTorch model with Triton Inference Server in 5 ...

With Triton, it's possible to deploy PyTorch, TensorFlow, or even XGBoost / LightGBM models. Triton can automatically optimize the model for inference on ......

Use Triton Inference Server with Amazon SageMaker

SageMaker enables customers to deploy a model using custom code with NVIDIA Triton Inference Server. This functionality is available through the development ...