Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Cupy not waiting for TensorRT execution

See original GitHub issue

Description

I am working with TensorRT 7.2.1.6 and cupy-111. I’d like to use cuda streams to optimize the application. It seems that cupy is not waiting for the TensorRT execution since the following code returns random results when the cupy stream is created with stream = cp.cuda.Stream(non_blocking=True) while the code works perfectly when using non_blocking=False. Note that I am also reusing the same stream after one execution completes.

To Reproduce

# Select stream
stream.use()
# Copy cupy array to the buffer
input_images = cp.array(batch_input_image)
cp.copyto(cuda_inputs[0], input_images)
# Run inference.
context.execute_async(bindings=bindings, stream_handle=stream.ptr, batch_size=len(batch_input_image))
# Copy results from the buffer
output_images = cuda_outputs[0].copy()
# Split results into batch
list_output = cp.split(output_images, indices_or_sections=len(batch_input_image), axis=0)
# Squeeze output arrays to remove axis of length one
list_output = [cp.squeeze(array) for array in list_output]
# Synchronize the stream
stream.synchronize()

Installation

Wheel (pip install cupy-***)

Environment

OS                           : Linux-5.9.0-yoctodev-standard-x86_64-with-Ubuntu-18.04-bionic
CuPy Version                 : 8.2.0
NumPy Version                : 1.19.5
SciPy Version                : 1.5.4
Cython Build Version         : 0.29.21
CUDA Root                    : /usr/local/cuda
CUDA Build Version           : 11010
CUDA Driver Version          : 11010
CUDA Runtime Version         : 11010
cuBLAS Version               : 11201
cuFFT Version                : 10300
cuRAND Version               : 10202
cuSOLVER Version             : (11, 0, 0)
cuSPARSE Version             : 11200
NVRTC Version                : (11, 1)
Thrust Version               : 100910
CUB Build Version            : 100910
cuDNN Build Version          : 8005
cuDNN Version                : 8004
NCCL Build Version           : 2708
NCCL Runtime Version         : 2708
cuTENSOR Version             : None
Device 0 Name                : GeForce RTX 2060
Device 0 Compute Capability  : 75

Additional Information

No response

Issue Analytics

State:
Created 2 years ago
Comments:17 (2 by maintainers)

Top GitHub Comments

3reactions

pranavm-nvidiacommented, Nov 16, 2021

@mfoglio It’s either a bug in the CalDetection/YoloLayerPlugin implementation or within TensorRT. Could you see if you’re able to reproduce this with a newer version of TensorRT (like 8.2)?

Also, I think it’s safe to rule out cupy, so maybe we should move the discussion to a TensorRT issue?

1reaction

mfogliocommented, Nov 16, 2021

Ok, so I saved the preprocessed input into a binary file:

        with cp.cuda.Stream(non_blocking=False) as stream:
            # Copy cupy array to the buffer
            # input_images = cp.array(batch_input_image)
            input_images = cp.load("preprocessed_input.npy")
            cp.copyto(cuda_inputs[0], input_images)
            # Run inference.
            context.execute_async(bindings=bindings, stream_handle=stream.ptr, batch_size=len(batch_input_image))
            # Copy results from the buffer
            output_images = cuda_outputs[0].copy()
            # Split results into batch
            list_output = cp.split(output_images, indices_or_sections=len(batch_input_image), axis=0)
            # Squeeze output arrays to remove axis of length one
            list_output = [cp.squeeze(array) for array in list_output]
            # Synchronize the stream
            stream.synchronize()

As for the initialization, it is the following:

        # Deserialize the engine from file
        with open(self.engine_path, "rb") as f:
            engine = runtime.deserialize_cuda_engine(f.read())
        context = engine.create_execution_context()

        cuda_inputs = []
        cuda_outputs = []
        bindings = []

        for binding in engine:
            shape = [engine.max_batch_size, *engine.get_binding_shape(binding)]
            dtype = trt.nptype(engine.get_binding_dtype(binding))
            # Allocate host and device buffers
            cuda_mem = cp.zeros(shape=shape, dtype=dtype)  # cuda.mem_alloc(host_mem.nbytes)
            # Append the device buffer to device bindings.
            bindings.append(cuda_mem.data.ptr)
            # Append to the appropriate list.
            if engine.binding_is_input(binding):
                cuda_inputs.append(cuda_mem)
            else:
                cuda_outputs.append(cuda_mem)

Top Results From Across the Web

Cuda streams synchronization issue with cupy and tensorRT

I am working with TensorRT and cupy. The following code does not wait for the cuda calls too be executed if I set...

NVIDIA Deep Learning TensorRT Documentation

This NVIDIA TensorRT Developer Guide demonstrates how to use the C++ and Python APIs for implementing the most common deep learning layers.

CUDA Streams: Best Practices and Common Pitfalls

Multiple processes (e.g. MPI) on a single GPU could not operate concurrently ... Synchronous: enqueue work and wait for completion.

[D] Should We Be Using JAX in 2022? : r/MachineLearning

JAX can be incredibly fast and, while it's a no-brainer for certain ... Bonus ex-industry perspective: PyTorch (libtorch) + TensorRT + C++ ...

NVIDIA TensorRT - manuals.plus

TensorRT's network definition does not deep-copy parameter arrays (such as the ... To wait for completion of asynchronous execution, synchronize on the ...