Cupy not waiting for TensorRT execution
See original GitHub issueDescription
I am working with TensorRT 7.2.1.6 and cupy-111
. I’d like to use cuda streams to optimize the application.
It seems that cupy is not waiting for the TensorRT execution since the following code returns random results when the cupy stream is created with stream = cp.cuda.Stream(non_blocking=True)
while the code works perfectly when using non_blocking=False
.
Note that I am also reusing the same stream after one execution completes.
To Reproduce
# Select stream
stream.use()
# Copy cupy array to the buffer
input_images = cp.array(batch_input_image)
cp.copyto(cuda_inputs[0], input_images)
# Run inference.
context.execute_async(bindings=bindings, stream_handle=stream.ptr, batch_size=len(batch_input_image))
# Copy results from the buffer
output_images = cuda_outputs[0].copy()
# Split results into batch
list_output = cp.split(output_images, indices_or_sections=len(batch_input_image), axis=0)
# Squeeze output arrays to remove axis of length one
list_output = [cp.squeeze(array) for array in list_output]
# Synchronize the stream
stream.synchronize()
Installation
Wheel (pip install cupy-***
)
Environment
OS : Linux-5.9.0-yoctodev-standard-x86_64-with-Ubuntu-18.04-bionic
CuPy Version : 8.2.0
NumPy Version : 1.19.5
SciPy Version : 1.5.4
Cython Build Version : 0.29.21
CUDA Root : /usr/local/cuda
CUDA Build Version : 11010
CUDA Driver Version : 11010
CUDA Runtime Version : 11010
cuBLAS Version : 11201
cuFFT Version : 10300
cuRAND Version : 10202
cuSOLVER Version : (11, 0, 0)
cuSPARSE Version : 11200
NVRTC Version : (11, 1)
Thrust Version : 100910
CUB Build Version : 100910
cuDNN Build Version : 8005
cuDNN Version : 8004
NCCL Build Version : 2708
NCCL Runtime Version : 2708
cuTENSOR Version : None
Device 0 Name : GeForce RTX 2060
Device 0 Compute Capability : 75
Additional Information
No response
Issue Analytics
- State:
- Created 2 years ago
- Comments:17 (2 by maintainers)
Top Results From Across the Web
Cuda streams synchronization issue with cupy and tensorRT
I am working with TensorRT and cupy. The following code does not wait for the cuda calls too be executed if I set...
Read more >NVIDIA Deep Learning TensorRT Documentation
This NVIDIA TensorRT Developer Guide demonstrates how to use the C++ and Python APIs for implementing the most common deep learning layers.
Read more >CUDA Streams: Best Practices and Common Pitfalls
Multiple processes (e.g. MPI) on a single GPU could not operate concurrently ... Synchronous: enqueue work and wait for completion.
Read more >[D] Should We Be Using JAX in 2022? : r/MachineLearning
JAX can be incredibly fast and, while it's a no-brainer for certain ... Bonus ex-industry perspective: PyTorch (libtorch) + TensorRT + C++ ...
Read more >NVIDIA TensorRT - manuals.plus
TensorRT's network definition does not deep-copy parameter arrays (such as the ... To wait for completion of asynchronous execution, synchronize on the ...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
@mfoglio It’s either a bug in the
CalDetection
/YoloLayerPlugin
implementation or within TensorRT. Could you see if you’re able to reproduce this with a newer version of TensorRT (like 8.2)?Also, I think it’s safe to rule out cupy, so maybe we should move the discussion to a TensorRT issue?
Ok, so I saved the preprocessed input into a binary file:
As for the initialization, it is the following: