Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Nonlinear increase of throughput as the number of CPU instances increases

See original GitHub issue

Description Increasing the number of CPU instances of a model increases compute infer time. Throughput is also not increasing linearly and sometimes decreasing.

Triton Information What version of Triton are you using? r21.12

Are you using the Triton container or did you build it yourself? I have used the Triton container and model analyzer container.

To Reproduce

Build model analyzer image initially

git clone https://github.com/triton-inference-server/model_analyzer -b r21.12
cd model_analyzer
docker build --pull -t model-analyzer .

Create add_sub model

git clone https://github.com/triton-inference-server/python_backend -b r21.12
cd python_backend
mkdir data
mkdir -p models/add_sub/1/
cp examples/add_sub/model.py models/add_sub/1/model.py
cp examples/add_sub/config.pbtxt models/add_sub/config.pbtxt

Create config.yaml inside data directory as follows

model_repository: /models
profile_models:
  add_sub:
    parameters:
      concurrency:
        start: 32
        stop: 32
    model_config_parameters:
      instance_group:
        - kind: KIND_CPU
          count: [1, 2, 3, 4]
override_output_model_repository: True
client_protocol: grpc

Create analyze.yaml inside data directory as follows

analysis_models:
  add_sub:
    objectives:
    - perf_throughput
inference_output_fields: [
    'model_name', 'concurrency', 'model_config_path',
    'instance_group', 'perf_throughput',
    'perf_latency_p99','perf_client_response_wait',
    'perf_server_queue', 'perf_server_compute_infer'
]

Run the model analyzer inside python_backend directory

docker run -it --rm --shm-size=2g --gpus all \
    -v /var/run/docker.sock:/var/run/docker.sock \
    -v ${PWD}/models:/models \
    -v ${PWD}/data/:/data \
    --net=host --name model-analyzer \
    model-analyzer /bin/bash

Run the following commands inside the container

model-analyzer profile --config-file /data/config.yaml
model-analyzer analyze --config-file /data/analyze.yaml

Here are the measurement results

Models (Inference):
Model     Concurrency   Model Config Path   Instance Group   Throughput (infer/sec)   p99 Latency (ms)   Response Wait Time (ms)   Server Queue time (ms)   Server Compute Infer time (ms)  
add_sub   32            add_sub_i3          4/CPU            11325.0                  5.3                2.8                       1.9                      0.3                             
add_sub   32            add_sub_i2          3/CPU            10517.0                  4.0                3.0                       2.4                      0.2                             
add_sub   32            add_sub_i1          2/CPU            8504.0                   5.1                3.7                       3.3                      0.2                             
add_sub   32            add_sub_i0          1/CPU            5049.0                   7.2                6.3                       6.0                      0.1

Describe the models (framework, inputs, outputs), ideally include the model configuration file (if using an ensemble include the model configuration file for that as well). I have used add_sub example in python_backend repo to test this behavior

Expected behavior I expect near linear increase of throughput and almost constant compute infer time as the number of CPU instances of a model increases.

Issue Analytics

State:
Created 2 years ago
Reactions:1
Comments:13 (7 by maintainers)

Top GitHub Comments

2reactions

Tabriziancommented, Jan 13, 2022

but does it also increase the CPU usage when the server is not idle? Also, the model is only doing a simple add/sub operation. Spending 180% CPU seems a lot. Is that normal?

The idle CPU usage is related to the issue that you’ve linked to and it has been fixed. The CPU usage for non-idle case I think is expected. You are measuring the performance under load and it can lead to significant CPU usage.

Also, the throughput increase can be non-linear because of the shared resources in use. For example, having more instances may lead to more cache misses and thus smaller performance gains.

1reaction

Tabriziancommented, Jan 31, 2022

Great. Thanks for letting us know. I’ll close this ticket as the original problem is resolved.