Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Diagnosing very slow performance

See original GitHub issue

🐛 Describe the bug

I’m trying to work out why my endpoint throughput is very slow. I wasn’t sure if this is the best forum but there doesn’t appear to be a specific torchserve forum on https://discuss.pytorch.org/

I have simple text classifier, I’ve created a custom handler as the default wasn’t suitable. I tested the handler by creating a harness based on https://github.com/frank-dong-ms/torchserve-performance/blob/main/test_models_windows.py - I also added custom timer metrics into my preprocess/inference/postprocess

The result is that most of the handle time is spent in inference, and my model performs as expected. It processes a batch of 1 text in about 40ms and a batch of 128 in 80ms - so clearly, to get good throughput, I need larger batches.

The throughput of a basic script, passing batches of 128 to the model is about 2000 examples per second. But torchserve only achieves 30-60 examples per second.

I’m fairly sure the bottleneck is not in the handler, the model log seems to imply it’s not receiving the request quick enough. I would hope that it could generate a batch of 128 in for maxBatchDelay=50 whilst the model is processing the previous batch, but in fact it only manages a handful. I’ve attached my model log below

My first question is what does the message Backend received inference at: 1669930796 means - specifically is the number a timestamp and if so why is the same value repeated many times given that the size of the batches being passed to the handler is well below the batch size of 128 set in the model config

Second how do I stream data faster to the endpoint? Our use case is to make many requests in succession. I’ve tried client batching, and that does increase throughput slightly but it’s still extremely slow.

My test code is based on an example, and I’ve also tried curl with the -P option and the time command. Throughput is orders of magnitude slower than a simple script running inference in a loop.

  import requests
  from requests_futures.sessions import FuturesSession
  from concurrent.futures import as_completed
  import json
  import time
  
  api = "http://localhost:8080/predictions/text_classifier"
  headers = {"Content-type": "application/json", "Accept": "text/plain"}
  
  session = FuturesSession()

    start_time = time.time()
    futures = []
    for text in texts:
        response = session.post(api, data=text)
        futures.append(response)

    for response in as_completed(futures):
        response = response.result().content.decode("utf-8")

    total_time = int((time.time() - start_time)*1e3)

    print("total time in ms:", total_time)
    throughput = len(texts) / total_time *1e3
    print("throughput:", throughput)

I’m going to look at gRPC as that is probably a better match for our use case (I think), but I feel I’m doing something wrong, or there’s an issue somewhere. In particular, the number of requests per second that the front end is receiving/handling appears to be way lower than I expected - the payload per request is a string of < 128 characters.

Error logs

model_log.log looks like

2022-12-02T08:39:55,921 [INFO ] W-9000-text_classifier_1.0.0-stdout MODEL_LOG - Backend received inference at: 1669930795
2022-12-02T08:39:55,925 [INFO ] W-9000-text_classifier_1.0.0-stdout MODEL_LOG - Received batch of 8 text
2022-12-02T08:39:56,015 [INFO ] W-9000-text_classifier_1.0.0-stdout MODEL_LOG - Backend received inference at: 1669930796
2022-12-02T08:39:56,016 [INFO ] W-9000-text_classifier_1.0.0-stdout MODEL_LOG - Received batch of 4 text
2022-12-02T08:39:56,079 [INFO ] W-9000-text_classifier_1.0.0-stdout MODEL_LOG - Backend received inference at: 1669930796
2022-12-02T08:39:56,084 [INFO ] W-9000-text_classifier_1.0.0-stdout MODEL_LOG - Received batch of 5 text
2022-12-02T08:39:56,147 [INFO ] W-9000-text_classifier_1.0.0-stdout MODEL_LOG - Backend received inference at: 1669930796
2022-12-02T08:39:56,149 [INFO ] W-9000-text_classifier_1.0.0-stdout MODEL_LOG - Received batch of 7 text
2022-12-02T08:39:56,214 [INFO ] W-9000-text_classifier_1.0.0-stdout MODEL_LOG - Backend received inference at: 1669930796
2022-12-02T08:39:56,215 [INFO ] W-9000-text_classifier_1.0.0-stdout MODEL_LOG - Received batch of 3 text
2022-12-02T08:39:56,279 [INFO ] W-9000-text_classifier_1.0.0-stdout MODEL_LOG - Backend received inference at: 1669930796
2022-12-02T08:39:56,281 [INFO ] W-9000-text_classifier_1.0.0-stdout MODEL_LOG - Received batch of 6 text
2022-12-02T08:39:56,345 [INFO ] W-9000-text_classifier_1.0.0-stdout MODEL_LOG - Backend received inference at: 1669930796
2022-12-02T08:39:56,346 [INFO ] W-9000-text_classifier_1.0.0-stdout MODEL_LOG - Received batch of 5 text
2022-12-02T08:39:56,407 [INFO ] W-9000-text_classifier_1.0.0-stdout MODEL_LOG - Backend received inference at: 1669930796
2022-12-02T08:39:56,409 [INFO ] W-9000-text_classifier_1.0.0-stdout MODEL_LOG - Received batch of 6 text
2022-12-02T08:39:56,472 [INFO ] W-9000-text_classifier_1.0.0-stdout MODEL_LOG - Backend received inference at: 1669930796
2022-12-02T08:39:56,473 [INFO ] W-9000-text_classifier_1.0.0-stdout MODEL_LOG - Received batch of 3 text
2022-12-02T08:39:56,534 [INFO ] W-9000-text_classifier_1.0.0-stdout MODEL_LOG - Backend received inference at: 1669930796
2022-12-02T08:39:56,536 [INFO ] W-9000-text_classifier_1.0.0-stdout MODEL_LOG - Received batch of 8 text
2022-12-02T08:39:56,641 [INFO ] W-9000-text_classifier_1.0.0-stdout MODEL_LOG - Backend received inference at: 1669930796
2022-12-02T08:39:56,642 [INFO ] W-9000-text_classifier_1.0.0-stdout MODEL_LOG - Received batch of 7 text
2022-12-02T08:39:56,707 [INFO ] W-9000-text_classifier_1.0.0-stdout MODEL_LOG - Backend received inference at: 1669930796
2022-12-02T08:39:56,709 [INFO ] W-9000-text_classifier_1.0.0-stdout MODEL_LOG - Received batch of 3 text
2022-12-02T08:39:56,775 [INFO ] W-9000-text_classifier_1.0.0-stdout MODEL_LOG - Backend received inference at: 1669930796
2022-12-02T08:39:56,776 [INFO ] W-9000-text_classifier_1.0.0-stdout MODEL_LOG - Received batch of 6 text
2022-12-02T08:39:56,842 [INFO ] W-9000-text_classifier_1.0.0-stdout MODEL_LOG - Backend received inference at: 1669930796
2022-12-02T08:39:56,844 [INFO ] W-9000-text_classifier_1.0.0-stdout MODEL_LOG - Received batch of 5 text
2022-12-02T08:39:56,953 [INFO ] W-9000-text_classifier_1.0.0-stdout MODEL_LOG - Backend received inference at: 1669930796
2022-12-02T08:39:56,955 [INFO ] W-9000-text_classifier_1.0.0-stdout MODEL_LOG - Received batch of 8 text
2022-12-02T08:39:57,056 [INFO ] W-9000-text_classifier_1.0.0-stdout MODEL_LOG - Backend received inference at: 1669930797
2022-12-02T08:39:57,057 [INFO ] W-9000-text_classifier_1.0.0-stdout MODEL_LOG - Received batch of 6 text
2022-12-02T08:39:57,128 [INFO ] W-9000-text_classifier_1.0.0-stdout MODEL_LOG - Backend received inference at: 1669930797
2022-12-02T08:39:57,129 [INFO ] W-9000-text_classifier_1.0.0-stdout MODEL_LOG - Received batch of 3 text

Installation instructions

Ran install_dependencies.py and pip install torchserve, torch-model-archiver, torch-workflow-archiver

Model Packaing

N/A

config.properties

model_store=/home/dave/dev/text-classifier/model_server/model_store2
load_models=all
number_of_netty_threads=8
netty_client_threads=8
default_workers_per_model=8
job_queue_size=1000
models = {\
    "text_classifier": {\
      "1.0.0": {\
        "defaultVersion": true,\
        "marName": "text_classifier.mar",\
        "minWorkers": 1,\
        "maxWorkers": 8,\
        "batchSize": 128,\
        "maxBatchDelay": 50,\
        "responseTimeout": 120\
      }\
    }\
  }

Versions

torchserve==0.6.1 torch-model-archiver==0.6.1

Python version: 3.8 (64-bit runtime) Python executable: /home/dave/dev/text-classifier/.venv/bin/python

Versions of relevant python libraries: captum==0.5.0 future==0.18.2 numpy==1.23.5 nvgpu==0.9.0 psutil==5.9.4 pytest==7.2.0 requests==2.28.1 requests-futures==1.0.0 torch==1.13.0+cu117 torch-model-archiver==0.6.1 torch-workflow-archiver==0.2.5 torchaudio==0.13.0+cu117 torchdata==0.5.0 torchserve==0.6.1 torchtext==0.14.0 torchvision==0.14.0+cu117 wheel==0.38.4 torch==1.13.0+cu117 torchtext==0.14.0 torchvision==0.14.0+cu117 torchaudio==0.13.0+cu117

Java Version:

OS: Ubuntu 20.04.5 LTS GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0 Clang version: N/A CMake version: version 3.24.2

Is CUDA available: Yes CUDA runtime version: 11.3.109 GPU models and configuration: GPU 0: NVIDIA GeForce RTX 3090 GPU 1: NVIDIA GeForce RTX 3090 Nvidia driver version: 520.61.05 cuDNN version: Probably one of the following: /usr/lib/x86_64-linux-gnu/libcudnn.so.8.7.0 /usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.7.0 /usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.7.0 /usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.7.0 /usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.7.0 /usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.7.0 /usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.7.0

Repro instructions

N/A

Possible Solution

No response

Issue Analytics

State:
Created 10 months ago
Comments:8

Top GitHub Comments

1reaction

lxningcommented, Dec 2, 2022

@david-waterworth you can directly use torchserve benchmark tools benchmarking-with-apache-bench or auto-benchmarking-with-apache-bench.

0reactions

david-waterworthcommented, Dec 2, 2022

Thanks @lxning! I have managed to get better batching by adapting this example. Now that I have client code in python that’s fully async I’m seeing the batching working as I expect.

I do wonder if instead of a fixed maxBatchDelay it might be better to have a maximum inter-request delay (or maybe as well as). That way if single requests are arriving they can be processed within the inter-request delay but if there is a burst then it batches.