Nonlinear increase of throughput as the number of CPU instances increases
See original GitHub issueDescription Increasing the number of CPU instances of a model increases compute infer time. Throughput is also not increasing linearly and sometimes decreasing.
Triton Information
What version of Triton are you using?
r21.12
Are you using the Triton container or did you build it yourself? I have used the Triton container and model analyzer container.
To Reproduce
Build model analyzer image initially
git clone https://github.com/triton-inference-server/model_analyzer -b r21.12
cd model_analyzer
docker build --pull -t model-analyzer .
Create add_sub
model
git clone https://github.com/triton-inference-server/python_backend -b r21.12
cd python_backend
mkdir data
mkdir -p models/add_sub/1/
cp examples/add_sub/model.py models/add_sub/1/model.py
cp examples/add_sub/config.pbtxt models/add_sub/config.pbtxt
Create config.yaml
inside data
directory as follows
model_repository: /models
profile_models:
add_sub:
parameters:
concurrency:
start: 32
stop: 32
model_config_parameters:
instance_group:
- kind: KIND_CPU
count: [1, 2, 3, 4]
override_output_model_repository: True
client_protocol: grpc
Create analyze.yaml
inside data
directory as follows
analysis_models:
add_sub:
objectives:
- perf_throughput
inference_output_fields: [
'model_name', 'concurrency', 'model_config_path',
'instance_group', 'perf_throughput',
'perf_latency_p99','perf_client_response_wait',
'perf_server_queue', 'perf_server_compute_infer'
]
Run the model analyzer inside python_backend
directory
docker run -it --rm --shm-size=2g --gpus all \
-v /var/run/docker.sock:/var/run/docker.sock \
-v ${PWD}/models:/models \
-v ${PWD}/data/:/data \
--net=host --name model-analyzer \
model-analyzer /bin/bash
Run the following commands inside the container
model-analyzer profile --config-file /data/config.yaml
model-analyzer analyze --config-file /data/analyze.yaml
Here are the measurement results
Models (Inference):
Model Concurrency Model Config Path Instance Group Throughput (infer/sec) p99 Latency (ms) Response Wait Time (ms) Server Queue time (ms) Server Compute Infer time (ms)
add_sub 32 add_sub_i3 4/CPU 11325.0 5.3 2.8 1.9 0.3
add_sub 32 add_sub_i2 3/CPU 10517.0 4.0 3.0 2.4 0.2
add_sub 32 add_sub_i1 2/CPU 8504.0 5.1 3.7 3.3 0.2
add_sub 32 add_sub_i0 1/CPU 5049.0 7.2 6.3 6.0 0.1
Describe the models (framework, inputs, outputs), ideally include the model configuration file (if using an ensemble include the model configuration file for that as well).
I have used add_sub example in python_backend
repo to test this behavior
Expected behavior I expect near linear increase of throughput and almost constant compute infer time as the number of CPU instances of a model increases.
Issue Analytics
- State:
- Created 2 years ago
- Reactions:1
- Comments:13 (7 by maintainers)
Top GitHub Comments
The idle CPU usage is related to the issue that you’ve linked to and it has been fixed. The CPU usage for non-idle case I think is expected. You are measuring the performance under load and it can lead to significant CPU usage.
Also, the throughput increase can be non-linear because of the shared resources in use. For example, having more instances may lead to more cache misses and thus smaller performance gains.
Great. Thanks for letting us know. I’ll close this ticket as the original problem is resolved.