Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Concurrent requests to multiple models cause NaN values in output

See original GitHub issue

Description I use Triton to host two TRT models: an object detector and a feature extractor. When both models are called to perform inference simultaneously (using the python API: tritonclient.grpc.InferenceServerClient.infer(...)) the feature extractor returns a numpy array containing NaN values.

This does not happen with multiple concurrent requests to any single model.

Triton Information docker image: nvcr.io/nvidia/tritonserver:20.12-py3 server_version: 2.6.0 tritonclient==2.3.0

To Reproduce

load triton with two models: osnet_x0_25_dyn and yolov4_32

I’m using docker-compose:

version: '3.7'

services:

  triton:
    hostname: triton
    image: nvcr.io/nvidia/tritonserver:20.12-py3
    command: tritonserver --model-repository=/models --strict-model-config=false --grpc-infer-allocation-pool-size=4 --pinned-memory-pool-byte-size=64000000 --log-verbose 0
    ports:
      - '8001:8001'
    volumes:
      - '/home/julien/modelrepo/models:/models'
      - '/home/julien/modelrepo/plugins:/plugins'
    environment:
      - 'LD_PRELOAD=/plugins/liblayerplugin.so'
    ulimits:
      stack:
        soft: 67108864
        hard: 67108864
    shm_size: '1gb'
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              device_ids: [ 1 ]
              capabilities: [ gpu ]

run both scripts below at the same time (in two different terminal windows).

feature_extractor.py:

import numpy as np
import tritonclient.grpc as grpcclient
from multiprocessing import Pool

def send_feature_extractor_request(i):

    buffer = np.random.rand(16, 3, 256, 128).astype(np.float32)
    triton_client = grpcclient.InferenceServerClient(url='localhost:8001')
    res = []
    inputs = [grpcclient.InferInput('input', buffer.shape, 'FP32')]
    outputs = [grpcclient.InferRequestedOutput('output')]
    inputs[0].set_data_from_numpy(buffer)
    for _ in range(100):
        result = triton_client.infer(
            'osnet_x0_25_dyn', inputs=inputs, outputs=outputs)
        output = result.as_numpy('output')
        res.append(np.isnan(np.sum(output)))

    return {i: any(res)}

if __name__ == '__main__':
    N = 2
    with Pool(N) as p:
        print(p.map(send_feature_extractor_request, np.arange(0, N).tolist()))

and detector.py:

import numpy as np
import tritonclient.grpc as grpcclient
from multiprocessing import Pool

def send_detector_request(i):

    buffer = np.random.rand(8, 3, 512, 512).astype(np.float32)
    triton_client = grpcclient.InferenceServerClient(url='localhost:8001')
    res = []
    inputs = [grpcclient.InferInput('data', buffer.shape, 'FP32')]
    outputs = [grpcclient.InferRequestedOutput('prob')]
    inputs[0].set_data_from_numpy(buffer)
    for _ in range(100):
        result = triton_client.infer(
            'yolov4_32', inputs=inputs, outputs=outputs)
        output = result.as_numpy('prob')
        res.append(np.isnan(np.sum(output)))

    return {i: any(res)}

if __name__ == '__main__':
    N = 2
    with Pool(N) as p:
        print(p.map(send_detector_request, np.arange(0, N).tolist()))

You will see that the feature_extractor.py script outputs: [{0: True}, {1: True}]. This means both subprocesses have encountered NaN values in their responses.

Expected behavior

I expect both models to return correct values, even when multiple clients send inference requests simultaneously. There should never be a NaN in output, which means the scripts should return [{0: False}, {1: False}].

Issue Analytics

State:
Created 2 years ago
Reactions:1
Comments:16 (9 by maintainers)

Top GitHub Comments

2reactions

julienschuermanscommented, Jul 2, 2021

Any updates on this? Other things we’ve tried:

Limit dynamic batching to 1 for both models
Limit instance group count to 1 for both models
Limit batch size to 1 for both models
Limit amount of sequential requests
Set FP16 precision via model config for yolov4 model

The problem still occurs, although it happens less often with fewer sequential requests (that still overlap).

Related to #2339 ?

2reactions

Tabriziancommented, Jun 25, 2021

Thanks for sharing the models. I have filed a bug against the dev team to further investigate this. They may follow up with you for additional information.

Top Results From Across the Web

Concurrent requests in Django - Stack Overflow

Take a look at django-concurrency. It handles concurrent editing using optimistic concurrency control pattern.

Handling Missing Data in ML Modelling (with Python) - Cardo AI

Handling missing data in Machine Learning Modelling with Python is hard. This is why we have prepared this guide to help you deal...

Invoke a Multi-Model Endpoint - Amazon SageMaker

The SageMaker Runtime InvokeEndpoint request supports X-Amzn-SageMaker-Target-Model as a new header that takes the relative path of the model specified for ...

How (Not) to Tune Your Model With Hyperopt - Databricks

Observe the results in an MLflow parallel coordinate plot and select the runs with lowest loss; Move the range towards those higher/lower values...

How to Develop Convolutional Neural Network Models for ...

A 1D CNN model needs sufficient context to learn a mapping from an input sequence to an output value. CNNs can support parallel...