question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. ItĀ collects links to all the places you might be looking at while hunting down a tough bug.

And, if youā€™re still stuck at the end, weā€™re happy to hop on a call to see how we can help out.

Diagnosing very slow performance

See original GitHub issue

šŸ› Describe the bug

Iā€™m trying to work out why my endpoint throughput is very slow. I wasnā€™t sure if this is the best forum but there doesnā€™t appear to be a specific torchserve forum on https://discuss.pytorch.org/

I have simple text classifier, Iā€™ve created a custom handler as the default wasnā€™t suitable. I tested the handler by creating a harness based on https://github.com/frank-dong-ms/torchserve-performance/blob/main/test_models_windows.py - I also added custom timer metrics into my preprocess/inference/postprocess

The result is that most of the handle time is spent in inference, and my model performs as expected. It processes a batch of 1 text in about 40ms and a batch of 128 in 80ms - so clearly, to get good throughput, I need larger batches.

The throughput of a basic script, passing batches of 128 to the model is about 2000 examples per second. But torchserve only achieves 30-60 examples per second.

Iā€™m fairly sure the bottleneck is not in the handler, the model log seems to imply itā€™s not receiving the request quick enough. I would hope that it could generate a batch of 128 in for maxBatchDelay=50 whilst the model is processing the previous batch, but in fact it only manages a handful. Iā€™ve attached my model log below

My first question is what does the message Backend received inference at: 1669930796 means - specifically is the number a timestamp and if so why is the same value repeated many times given that the size of the batches being passed to the handler is well below the batch size of 128 set in the model config

Second how do I stream data faster to the endpoint? Our use case is to make many requests in succession. Iā€™ve tried client batching, and that does increase throughput slightly but itā€™s still extremely slow.

My test code is based on an example, and Iā€™ve also tried curl with the -P option and the time command. Throughput is orders of magnitude slower than a simple script running inference in a loop.

  import requests
  from requests_futures.sessions import FuturesSession
  from concurrent.futures import as_completed
  import json
  import time
  
  api = "http://localhost:8080/predictions/text_classifier"
  headers = {"Content-type": "application/json", "Accept": "text/plain"}
  
  session = FuturesSession()

    start_time = time.time()
    futures = []
    for text in texts:
        response = session.post(api, data=text)
        futures.append(response)

    for response in as_completed(futures):
        response = response.result().content.decode("utf-8")

    total_time = int((time.time() - start_time)*1e3)

    print("total time in ms:", total_time)
    throughput = len(texts) / total_time *1e3
    print("throughput:", throughput)

Iā€™m going to look at gRPC as that is probably a better match for our use case (I think), but I feel Iā€™m doing something wrong, or thereā€™s an issue somewhere. In particular, the number of requests per second that the front end is receiving/handling appears to be way lower than I expected - the payload per request is a string of < 128 characters.

Error logs

model_log.log looks like

2022-12-02T08:39:55,921 [INFO ] W-9000-text_classifier_1.0.0-stdout MODEL_LOG - Backend received inference at: 1669930795
2022-12-02T08:39:55,925 [INFO ] W-9000-text_classifier_1.0.0-stdout MODEL_LOG - Received batch of 8 text
2022-12-02T08:39:56,015 [INFO ] W-9000-text_classifier_1.0.0-stdout MODEL_LOG - Backend received inference at: 1669930796
2022-12-02T08:39:56,016 [INFO ] W-9000-text_classifier_1.0.0-stdout MODEL_LOG - Received batch of 4 text
2022-12-02T08:39:56,079 [INFO ] W-9000-text_classifier_1.0.0-stdout MODEL_LOG - Backend received inference at: 1669930796
2022-12-02T08:39:56,084 [INFO ] W-9000-text_classifier_1.0.0-stdout MODEL_LOG - Received batch of 5 text
2022-12-02T08:39:56,147 [INFO ] W-9000-text_classifier_1.0.0-stdout MODEL_LOG - Backend received inference at: 1669930796
2022-12-02T08:39:56,149 [INFO ] W-9000-text_classifier_1.0.0-stdout MODEL_LOG - Received batch of 7 text
2022-12-02T08:39:56,214 [INFO ] W-9000-text_classifier_1.0.0-stdout MODEL_LOG - Backend received inference at: 1669930796
2022-12-02T08:39:56,215 [INFO ] W-9000-text_classifier_1.0.0-stdout MODEL_LOG - Received batch of 3 text
2022-12-02T08:39:56,279 [INFO ] W-9000-text_classifier_1.0.0-stdout MODEL_LOG - Backend received inference at: 1669930796
2022-12-02T08:39:56,281 [INFO ] W-9000-text_classifier_1.0.0-stdout MODEL_LOG - Received batch of 6 text
2022-12-02T08:39:56,345 [INFO ] W-9000-text_classifier_1.0.0-stdout MODEL_LOG - Backend received inference at: 1669930796
2022-12-02T08:39:56,346 [INFO ] W-9000-text_classifier_1.0.0-stdout MODEL_LOG - Received batch of 5 text
2022-12-02T08:39:56,407 [INFO ] W-9000-text_classifier_1.0.0-stdout MODEL_LOG - Backend received inference at: 1669930796
2022-12-02T08:39:56,409 [INFO ] W-9000-text_classifier_1.0.0-stdout MODEL_LOG - Received batch of 6 text
2022-12-02T08:39:56,472 [INFO ] W-9000-text_classifier_1.0.0-stdout MODEL_LOG - Backend received inference at: 1669930796
2022-12-02T08:39:56,473 [INFO ] W-9000-text_classifier_1.0.0-stdout MODEL_LOG - Received batch of 3 text
2022-12-02T08:39:56,534 [INFO ] W-9000-text_classifier_1.0.0-stdout MODEL_LOG - Backend received inference at: 1669930796
2022-12-02T08:39:56,536 [INFO ] W-9000-text_classifier_1.0.0-stdout MODEL_LOG - Received batch of 8 text
2022-12-02T08:39:56,641 [INFO ] W-9000-text_classifier_1.0.0-stdout MODEL_LOG - Backend received inference at: 1669930796
2022-12-02T08:39:56,642 [INFO ] W-9000-text_classifier_1.0.0-stdout MODEL_LOG - Received batch of 7 text
2022-12-02T08:39:56,707 [INFO ] W-9000-text_classifier_1.0.0-stdout MODEL_LOG - Backend received inference at: 1669930796
2022-12-02T08:39:56,709 [INFO ] W-9000-text_classifier_1.0.0-stdout MODEL_LOG - Received batch of 3 text
2022-12-02T08:39:56,775 [INFO ] W-9000-text_classifier_1.0.0-stdout MODEL_LOG - Backend received inference at: 1669930796
2022-12-02T08:39:56,776 [INFO ] W-9000-text_classifier_1.0.0-stdout MODEL_LOG - Received batch of 6 text
2022-12-02T08:39:56,842 [INFO ] W-9000-text_classifier_1.0.0-stdout MODEL_LOG - Backend received inference at: 1669930796
2022-12-02T08:39:56,844 [INFO ] W-9000-text_classifier_1.0.0-stdout MODEL_LOG - Received batch of 5 text
2022-12-02T08:39:56,953 [INFO ] W-9000-text_classifier_1.0.0-stdout MODEL_LOG - Backend received inference at: 1669930796
2022-12-02T08:39:56,955 [INFO ] W-9000-text_classifier_1.0.0-stdout MODEL_LOG - Received batch of 8 text
2022-12-02T08:39:57,056 [INFO ] W-9000-text_classifier_1.0.0-stdout MODEL_LOG - Backend received inference at: 1669930797
2022-12-02T08:39:57,057 [INFO ] W-9000-text_classifier_1.0.0-stdout MODEL_LOG - Received batch of 6 text
2022-12-02T08:39:57,128 [INFO ] W-9000-text_classifier_1.0.0-stdout MODEL_LOG - Backend received inference at: 1669930797
2022-12-02T08:39:57,129 [INFO ] W-9000-text_classifier_1.0.0-stdout MODEL_LOG - Received batch of 3 text

Installation instructions

Ran install_dependencies.py and pip install torchserve, torch-model-archiver, torch-workflow-archiver

Model Packaing

N/A

config.properties

model_store=/home/dave/dev/text-classifier/model_server/model_store2
load_models=all
number_of_netty_threads=8
netty_client_threads=8
default_workers_per_model=8
job_queue_size=1000
models = {\
    "text_classifier": {\
      "1.0.0": {\
        "defaultVersion": true,\
        "marName": "text_classifier.mar",\
        "minWorkers": 1,\
        "maxWorkers": 8,\
        "batchSize": 128,\
        "maxBatchDelay": 50,\
        "responseTimeout": 120\
      }\
    }\
  }

Versions

torchserve==0.6.1 torch-model-archiver==0.6.1

Python version: 3.8 (64-bit runtime) Python executable: /home/dave/dev/text-classifier/.venv/bin/python

Versions of relevant python libraries: captum==0.5.0 future==0.18.2 numpy==1.23.5 nvgpu==0.9.0 psutil==5.9.4 pytest==7.2.0 requests==2.28.1 requests-futures==1.0.0 torch==1.13.0+cu117 torch-model-archiver==0.6.1 torch-workflow-archiver==0.2.5 torchaudio==0.13.0+cu117 torchdata==0.5.0 torchserve==0.6.1 torchtext==0.14.0 torchvision==0.14.0+cu117 wheel==0.38.4 torch==1.13.0+cu117 torchtext==0.14.0 torchvision==0.14.0+cu117 torchaudio==0.13.0+cu117

Java Version:

OS: Ubuntu 20.04.5 LTS GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0 Clang version: N/A CMake version: version 3.24.2

Is CUDA available: Yes CUDA runtime version: 11.3.109 GPU models and configuration: GPU 0: NVIDIA GeForce RTX 3090 GPU 1: NVIDIA GeForce RTX 3090 Nvidia driver version: 520.61.05 cuDNN version: Probably one of the following: /usr/lib/x86_64-linux-gnu/libcudnn.so.8.7.0 /usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.7.0 /usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.7.0 /usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.7.0 /usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.7.0 /usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.7.0 /usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.7.0

Repro instructions

N/A

Possible Solution

No response

Issue Analytics

  • State:closed
  • Created 10 months ago
  • Comments:8

github_iconTop GitHub Comments

1reaction
lxningcommented, Dec 2, 2022

@david-waterworth you can directly use torchserve benchmark tools benchmarking-with-apache-bench or auto-benchmarking-with-apache-bench.

0reactions
david-waterworthcommented, Dec 2, 2022

Thanks @lxning! I have managed to get better batching by adapting this example. Now that I have client code in python thatā€™s fully async Iā€™m seeing the batching working as I expect.

I do wonder if instead of a fixed maxBatchDelay it might be better to have a maximum inter-request delay (or maybe as well as). That way if single requests are arriving they can be processed within the inter-request delay but if there is a burst then it batches.

Read more comments on GitHub >

github_iconTop Results From Across the Web

How to Diagnose a Slow Performing Computer - wikiHow
If your PC or Mac is running slowly, it's typically the result of either a software problem, a full hard drive, or faulty...
Read more >
How to Check the Performance of Your PC and Speed It Up
If your PC is only slow during boot up, then it's possible that it's being bogged down by applications that launch on startup....
Read more >
How to Diagnose a Slow Windows 10 Computer | by Sam Cook
If your computer's general performance is still slow, you're down to look at either a major issue with Windows, or a problem with...
Read more >
PC running slow? Here's how to speed things up - CNET
Launch it by pressing Shift+Esc while using Chrome, or click on the menu button > More Tools > Task manager. If you find...
Read more >
How to Troubleshoot Slow Performance Issues | Dell US
Optimize your computer using SupportAssist. Ā· Restart the computer Ā· Run a hardware diagnostic test Ā· Scan your computer for malware Ā· Restore...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found