Diagnosing very slow performance
See original GitHub issueš Describe the bug
Iām trying to work out why my endpoint throughput is very slow. I wasnāt sure if this is the best forum but there doesnāt appear to be a specific torchserve forum on https://discuss.pytorch.org/
I have simple text classifier, Iāve created a custom handler as the default wasnāt suitable. I tested the handler by creating a harness based on https://github.com/frank-dong-ms/torchserve-performance/blob/main/test_models_windows.py - I also added custom timer metrics into my preprocess
/inference
/postprocess
The result is that most of the handle
time is spent in inference
, and my model performs as expected. It processes a batch of 1 text in about 40ms and a batch of 128 in 80ms - so clearly, to get good throughput, I need larger batches.
The throughput of a basic script, passing batches of 128 to the model is about 2000 examples per second. But torchserve
only achieves 30-60 examples per second.
Iām fairly sure the bottleneck is not in the handler, the model log seems to imply itās not receiving the request quick enough. I would hope that it could generate a batch of 128 in for maxBatchDelay=50 whilst the model is processing the previous batch, but in fact it only manages a handful. Iāve attached my model log below
My first question is what does the message Backend received inference at: 1669930796
means - specifically is the number a timestamp and if so why is the same value repeated many times given that the size of the batches being passed to the handler is well below the batch size of 128 set in the model config
Second how do I stream data faster to the endpoint? Our use case is to make many requests in succession. Iāve tried client batching, and that does increase throughput slightly but itās still extremely slow.
My test code is based on an example, and Iāve also tried curl with the -P option and the time command. Throughput is orders of magnitude slower than a simple script running inference in a loop.
import requests
from requests_futures.sessions import FuturesSession
from concurrent.futures import as_completed
import json
import time
api = "http://localhost:8080/predictions/text_classifier"
headers = {"Content-type": "application/json", "Accept": "text/plain"}
session = FuturesSession()
start_time = time.time()
futures = []
for text in texts:
response = session.post(api, data=text)
futures.append(response)
for response in as_completed(futures):
response = response.result().content.decode("utf-8")
total_time = int((time.time() - start_time)*1e3)
print("total time in ms:", total_time)
throughput = len(texts) / total_time *1e3
print("throughput:", throughput)
Iām going to look at gRPC as that is probably a better match for our use case (I think), but I feel Iām doing something wrong, or thereās an issue somewhere. In particular, the number of requests per second that the front end is receiving/handling appears to be way lower than I expected - the payload per request is a string of < 128 characters.
Error logs
model_log.log looks like
2022-12-02T08:39:55,921 [INFO ] W-9000-text_classifier_1.0.0-stdout MODEL_LOG - Backend received inference at: 1669930795
2022-12-02T08:39:55,925 [INFO ] W-9000-text_classifier_1.0.0-stdout MODEL_LOG - Received batch of 8 text
2022-12-02T08:39:56,015 [INFO ] W-9000-text_classifier_1.0.0-stdout MODEL_LOG - Backend received inference at: 1669930796
2022-12-02T08:39:56,016 [INFO ] W-9000-text_classifier_1.0.0-stdout MODEL_LOG - Received batch of 4 text
2022-12-02T08:39:56,079 [INFO ] W-9000-text_classifier_1.0.0-stdout MODEL_LOG - Backend received inference at: 1669930796
2022-12-02T08:39:56,084 [INFO ] W-9000-text_classifier_1.0.0-stdout MODEL_LOG - Received batch of 5 text
2022-12-02T08:39:56,147 [INFO ] W-9000-text_classifier_1.0.0-stdout MODEL_LOG - Backend received inference at: 1669930796
2022-12-02T08:39:56,149 [INFO ] W-9000-text_classifier_1.0.0-stdout MODEL_LOG - Received batch of 7 text
2022-12-02T08:39:56,214 [INFO ] W-9000-text_classifier_1.0.0-stdout MODEL_LOG - Backend received inference at: 1669930796
2022-12-02T08:39:56,215 [INFO ] W-9000-text_classifier_1.0.0-stdout MODEL_LOG - Received batch of 3 text
2022-12-02T08:39:56,279 [INFO ] W-9000-text_classifier_1.0.0-stdout MODEL_LOG - Backend received inference at: 1669930796
2022-12-02T08:39:56,281 [INFO ] W-9000-text_classifier_1.0.0-stdout MODEL_LOG - Received batch of 6 text
2022-12-02T08:39:56,345 [INFO ] W-9000-text_classifier_1.0.0-stdout MODEL_LOG - Backend received inference at: 1669930796
2022-12-02T08:39:56,346 [INFO ] W-9000-text_classifier_1.0.0-stdout MODEL_LOG - Received batch of 5 text
2022-12-02T08:39:56,407 [INFO ] W-9000-text_classifier_1.0.0-stdout MODEL_LOG - Backend received inference at: 1669930796
2022-12-02T08:39:56,409 [INFO ] W-9000-text_classifier_1.0.0-stdout MODEL_LOG - Received batch of 6 text
2022-12-02T08:39:56,472 [INFO ] W-9000-text_classifier_1.0.0-stdout MODEL_LOG - Backend received inference at: 1669930796
2022-12-02T08:39:56,473 [INFO ] W-9000-text_classifier_1.0.0-stdout MODEL_LOG - Received batch of 3 text
2022-12-02T08:39:56,534 [INFO ] W-9000-text_classifier_1.0.0-stdout MODEL_LOG - Backend received inference at: 1669930796
2022-12-02T08:39:56,536 [INFO ] W-9000-text_classifier_1.0.0-stdout MODEL_LOG - Received batch of 8 text
2022-12-02T08:39:56,641 [INFO ] W-9000-text_classifier_1.0.0-stdout MODEL_LOG - Backend received inference at: 1669930796
2022-12-02T08:39:56,642 [INFO ] W-9000-text_classifier_1.0.0-stdout MODEL_LOG - Received batch of 7 text
2022-12-02T08:39:56,707 [INFO ] W-9000-text_classifier_1.0.0-stdout MODEL_LOG - Backend received inference at: 1669930796
2022-12-02T08:39:56,709 [INFO ] W-9000-text_classifier_1.0.0-stdout MODEL_LOG - Received batch of 3 text
2022-12-02T08:39:56,775 [INFO ] W-9000-text_classifier_1.0.0-stdout MODEL_LOG - Backend received inference at: 1669930796
2022-12-02T08:39:56,776 [INFO ] W-9000-text_classifier_1.0.0-stdout MODEL_LOG - Received batch of 6 text
2022-12-02T08:39:56,842 [INFO ] W-9000-text_classifier_1.0.0-stdout MODEL_LOG - Backend received inference at: 1669930796
2022-12-02T08:39:56,844 [INFO ] W-9000-text_classifier_1.0.0-stdout MODEL_LOG - Received batch of 5 text
2022-12-02T08:39:56,953 [INFO ] W-9000-text_classifier_1.0.0-stdout MODEL_LOG - Backend received inference at: 1669930796
2022-12-02T08:39:56,955 [INFO ] W-9000-text_classifier_1.0.0-stdout MODEL_LOG - Received batch of 8 text
2022-12-02T08:39:57,056 [INFO ] W-9000-text_classifier_1.0.0-stdout MODEL_LOG - Backend received inference at: 1669930797
2022-12-02T08:39:57,057 [INFO ] W-9000-text_classifier_1.0.0-stdout MODEL_LOG - Received batch of 6 text
2022-12-02T08:39:57,128 [INFO ] W-9000-text_classifier_1.0.0-stdout MODEL_LOG - Backend received inference at: 1669930797
2022-12-02T08:39:57,129 [INFO ] W-9000-text_classifier_1.0.0-stdout MODEL_LOG - Received batch of 3 text
Installation instructions
Ran install_dependencies.py and pip install torchserve, torch-model-archiver, torch-workflow-archiver
Model Packaing
N/A
config.properties
model_store=/home/dave/dev/text-classifier/model_server/model_store2
load_models=all
number_of_netty_threads=8
netty_client_threads=8
default_workers_per_model=8
job_queue_size=1000
models = {\
"text_classifier": {\
"1.0.0": {\
"defaultVersion": true,\
"marName": "text_classifier.mar",\
"minWorkers": 1,\
"maxWorkers": 8,\
"batchSize": 128,\
"maxBatchDelay": 50,\
"responseTimeout": 120\
}\
}\
}
Versions
torchserve==0.6.1 torch-model-archiver==0.6.1
Python version: 3.8 (64-bit runtime) Python executable: /home/dave/dev/text-classifier/.venv/bin/python
Versions of relevant python libraries: captum==0.5.0 future==0.18.2 numpy==1.23.5 nvgpu==0.9.0 psutil==5.9.4 pytest==7.2.0 requests==2.28.1 requests-futures==1.0.0 torch==1.13.0+cu117 torch-model-archiver==0.6.1 torch-workflow-archiver==0.2.5 torchaudio==0.13.0+cu117 torchdata==0.5.0 torchserve==0.6.1 torchtext==0.14.0 torchvision==0.14.0+cu117 wheel==0.38.4 torch==1.13.0+cu117 torchtext==0.14.0 torchvision==0.14.0+cu117 torchaudio==0.13.0+cu117
Java Version:
OS: Ubuntu 20.04.5 LTS GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0 Clang version: N/A CMake version: version 3.24.2
Is CUDA available: Yes CUDA runtime version: 11.3.109 GPU models and configuration: GPU 0: NVIDIA GeForce RTX 3090 GPU 1: NVIDIA GeForce RTX 3090 Nvidia driver version: 520.61.05 cuDNN version: Probably one of the following: /usr/lib/x86_64-linux-gnu/libcudnn.so.8.7.0 /usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.7.0 /usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.7.0 /usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.7.0 /usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.7.0 /usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.7.0 /usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.7.0
Repro instructions
N/A
Possible Solution
No response
Issue Analytics
- State:
- Created 10 months ago
- Comments:8
Top GitHub Comments
@david-waterworth you can directly use torchserve benchmark tools benchmarking-with-apache-bench or auto-benchmarking-with-apache-bench.
Thanks @lxning! I have managed to get better batching by adapting this example. Now that I have client code in python thatās fully async Iām seeing the batching working as I expect.
I do wonder if instead of a fixed maxBatchDelay it might be better to have a maximum inter-request delay (or maybe as well as). That way if single requests are arriving they can be processed within the inter-request delay but if there is a burst then it batches.