Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Python backend on CPU is slower when serving a pytorch model

See original GitHub issue

Description I have a python model that uses pre-trained roberta model for the inference. I have added this model to Triton to use python backend to serve. We also have the exact same python code/model being served using an fastapi application. Both are running on hardware with same specs. When I compared both the models in terms of performance on CPU, the latency with Triton is very high. I used pytorch profiler to profile the code to debug what is causing the higher latencies with Triton. Below screenshots shows the outputs of pytorch profiler.

Triton-CPU

triton-cpu

FastAPI-CPU

api-cpu

Based on the screenshots I can see that particularly the native_layer_norm is taking significantly longer with Triton when compared with model running using our fastapi application. native_layer_norm is part of the pre-trained roberta model.

Triton Information What version of Triton are you using? Version: 21.07

Are you using the Triton container or did you build it yourself? I built the image myself based on r21.07 but I have also tested serving the model using Official Triton Containers-r21.07 and r21.08 the issue still remains the same

To Reproduce Steps to reproduce the behavior.

Describe the models (framework, inputs, outputs), ideally include the model configuration file (if using an ensemble include the model configuration file for that as well).

Dependencies: torch==1.6.0 transformers==3.5.1

config.pbtxt

name: "sample-model"
backend: "python"
max_batch_size: 8

input [
  {
    name: "INPUT0"
    data_type: TYPE_STRING
    dims:  [1]

  }
]

output [
  {
    name: "OUTPUT0"
    data_type: TYPE_STRING
    dims: [1]
  }
]

parameters: {
  key: "EXECUTION_ENV_PATH",
  value: {string_value: "<path to execution env>"}
}

instance_group [
  {
    count: 1
    kind: KIND_CPU
  }
]

Expected behavior Ideally the performance should be similar when the same model is being run with Triton

Issue Analytics

State:
Created 2 years ago
Comments:29 (16 by maintainers)

Top GitHub Comments

3reactions

SaratM34commented, Jan 20, 2022

@tanmayv25 In my initial testing the results looks good. The performance is greatly improved. Below is some summary from initial testing

Before Fix

Inferences/Second vs. Client Average Batch Latency
Concurrency: 5, throughput: 3.05 infer/sec, latency 1655480 usec
Concurrency: 10, throughput: 3.17 infer/sec, latency 3153540 usec
Concurrency: 15, throughput: 3.21 infer/sec, latency 4687196 usec

After Fix

Inferences/Second vs. Client Average Batch Latency
Concurrency: 5, throughput: 17.72 infer/sec, latency 282101 usec
Concurrency: 10, throughput: 17.75 infer/sec, latency 562845 usec
Concurrency: 15, throughput: 17.95 infer/sec, latency 836044 usec

I have some more testing pending. I will update here Once I am done with the complete testing.

1reaction

zhaohbcommented, Jan 27, 2022

@tanmayv25 ok, thank you very much.

Top Results From Across the Web

triton python backend load time of pytorch model is 4x slower ...

When I deploy the mdoel using python backend, the loading time is around 0.8 seconds. However, if I load the onnx model without...

Performance Tuning Guide - PyTorch

Performance Tuning Guide is a set of optimizations and best practices which can accelerate training and inference of deep learning models in PyTorch....

Libtorch's CPU inference is much slower on Windows than on ...

I'm on Windows (VS2019, LibTorch 1.130) and I have a similar issue, inference is super slow, but only on the first pass, not...

Code didn't speed up as expected when using `mps`

I'm really excited to try out the latest pytorch build (1.12.0.dev20220518) for the m1 gpu support, but on my device (M1 max, 64GB, ......

Grokking PyTorch Intel CPU performance from first principles ...

The pipeline of CPU can conceptually be simplified and divided into two: the frontend and the backend. The frontend is responsible for fetching...