Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Poor Performance on Triton vs inferencing w/o triton

See original GitHub issue

Description Using a B7 Efficientnet model, converted to TensorRT halfprecision. Running the RT engine without anything, 1 image takes ~20ms, 8 images takes ~160ms

When loaded via Triton, same model, 1 image takes ~50ms. 8 images takes 330ms. I am sending results to and from the same physical machine. I understand there’s some overhead from HTTP and stuff but this seems excessive?

My inferencing code:

def inference_rt(engine, data, input_name):
    runner = TrtRunner(engine)
    with runner:
        outputs = runner.infer(feed_dict={input_name: data})
    return outputs

Any performance tips or things im missing? Is this typical? thanks

Triton Information 22.03

Are you using the Triton container or did you build it yourself? using 22.03 container

Issue Analytics

State:
Created a year ago
Comments:8 (6 by maintainers)

Top GitHub Comments

1reaction

klappec-bscicommented, May 11, 2022

Using the perf analyzer, the time spent inferencing seems close. And I am on a slower gpu. It’s around 20ms per inference. I think this seems correct. I’m still disappointed by how much overhead and latency there is shifting data between models and in making and receiving requests.

Any tips on reducing that? The time spent not inferencing is several times greater than the time spent inferencing.

This is intended for remote inferencing, so shared memory I dont think is an option.

thank you again for all your support

0reactions

dyastremskycommented, May 27, 2022

Closing issue due to lack of activity. Please re-open the issue if you would like to follow up with this issue.