Concurrent requests to multiple models cause NaN values in output
See original GitHub issueDescription
I use Triton to host two TRT models: an object detector and a feature extractor. When both models are called to perform inference simultaneously (using the python API: tritonclient.grpc.InferenceServerClient.infer(...)
) the feature extractor returns a numpy array containing NaN values.
This does not happen with multiple concurrent requests to any single model.
Triton Information
docker image: nvcr.io/nvidia/tritonserver:20.12-py3
server_version: 2.6.0
tritonclient==2.3.0
To Reproduce
- load triton with two models:
osnet_x0_25_dyn
andyolov4_32
I’m using docker-compose:
version: '3.7'
services:
triton:
hostname: triton
image: nvcr.io/nvidia/tritonserver:20.12-py3
command: tritonserver --model-repository=/models --strict-model-config=false --grpc-infer-allocation-pool-size=4 --pinned-memory-pool-byte-size=64000000 --log-verbose 0
ports:
- '8001:8001'
volumes:
- '/home/julien/modelrepo/models:/models'
- '/home/julien/modelrepo/plugins:/plugins'
environment:
- 'LD_PRELOAD=/plugins/liblayerplugin.so'
ulimits:
stack:
soft: 67108864
hard: 67108864
shm_size: '1gb'
deploy:
resources:
reservations:
devices:
- driver: nvidia
device_ids: [ 1 ]
capabilities: [ gpu ]
- run both scripts below at the same time (in two different terminal windows).
feature_extractor.py
:
import numpy as np
import tritonclient.grpc as grpcclient
from multiprocessing import Pool
def send_feature_extractor_request(i):
buffer = np.random.rand(16, 3, 256, 128).astype(np.float32)
triton_client = grpcclient.InferenceServerClient(url='localhost:8001')
res = []
inputs = [grpcclient.InferInput('input', buffer.shape, 'FP32')]
outputs = [grpcclient.InferRequestedOutput('output')]
inputs[0].set_data_from_numpy(buffer)
for _ in range(100):
result = triton_client.infer(
'osnet_x0_25_dyn', inputs=inputs, outputs=outputs)
output = result.as_numpy('output')
res.append(np.isnan(np.sum(output)))
return {i: any(res)}
if __name__ == '__main__':
N = 2
with Pool(N) as p:
print(p.map(send_feature_extractor_request, np.arange(0, N).tolist()))
and detector.py
:
import numpy as np
import tritonclient.grpc as grpcclient
from multiprocessing import Pool
def send_detector_request(i):
buffer = np.random.rand(8, 3, 512, 512).astype(np.float32)
triton_client = grpcclient.InferenceServerClient(url='localhost:8001')
res = []
inputs = [grpcclient.InferInput('data', buffer.shape, 'FP32')]
outputs = [grpcclient.InferRequestedOutput('prob')]
inputs[0].set_data_from_numpy(buffer)
for _ in range(100):
result = triton_client.infer(
'yolov4_32', inputs=inputs, outputs=outputs)
output = result.as_numpy('prob')
res.append(np.isnan(np.sum(output)))
return {i: any(res)}
if __name__ == '__main__':
N = 2
with Pool(N) as p:
print(p.map(send_detector_request, np.arange(0, N).tolist()))
You will see that the feature_extractor.py
script outputs: [{0: True}, {1: True}]
. This means both subprocesses have encountered NaN values in their responses.
Expected behavior
I expect both models to return correct values, even when multiple clients send inference requests simultaneously.
There should never be a NaN in output
, which means the scripts should return [{0: False}, {1: False}]
.
Issue Analytics
- State:
- Created 2 years ago
- Reactions:1
- Comments:16 (9 by maintainers)
Top GitHub Comments
Any updates on this? Other things we’ve tried:
The problem still occurs, although it happens less often with fewer sequential requests (that still overlap).
Related to #2339 ?
Thanks for sharing the models. I have filed a bug against the dev team to further investigate this. They may follow up with you for additional information.