question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Cupy not waiting for TensorRT execution

See original GitHub issue

Description

I am working with TensorRT 7.2.1.6 and cupy-111. I’d like to use cuda streams to optimize the application. It seems that cupy is not waiting for the TensorRT execution since the following code returns random results when the cupy stream is created with stream = cp.cuda.Stream(non_blocking=True) while the code works perfectly when using non_blocking=False. Note that I am also reusing the same stream after one execution completes.

To Reproduce

# Select stream
stream.use()
# Copy cupy array to the buffer
input_images = cp.array(batch_input_image)
cp.copyto(cuda_inputs[0], input_images)
# Run inference.
context.execute_async(bindings=bindings, stream_handle=stream.ptr, batch_size=len(batch_input_image))
# Copy results from the buffer
output_images = cuda_outputs[0].copy()
# Split results into batch
list_output = cp.split(output_images, indices_or_sections=len(batch_input_image), axis=0)
# Squeeze output arrays to remove axis of length one
list_output = [cp.squeeze(array) for array in list_output]
# Synchronize the stream
stream.synchronize()

Installation

Wheel (pip install cupy-***)

Environment

OS                           : Linux-5.9.0-yoctodev-standard-x86_64-with-Ubuntu-18.04-bionic
CuPy Version                 : 8.2.0
NumPy Version                : 1.19.5
SciPy Version                : 1.5.4
Cython Build Version         : 0.29.21
CUDA Root                    : /usr/local/cuda
CUDA Build Version           : 11010
CUDA Driver Version          : 11010
CUDA Runtime Version         : 11010
cuBLAS Version               : 11201
cuFFT Version                : 10300
cuRAND Version               : 10202
cuSOLVER Version             : (11, 0, 0)
cuSPARSE Version             : 11200
NVRTC Version                : (11, 1)
Thrust Version               : 100910
CUB Build Version            : 100910
cuDNN Build Version          : 8005
cuDNN Version                : 8004
NCCL Build Version           : 2708
NCCL Runtime Version         : 2708
cuTENSOR Version             : None
Device 0 Name                : GeForce RTX 2060
Device 0 Compute Capability  : 75

Additional Information

No response

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:17 (2 by maintainers)

github_iconTop GitHub Comments

3reactions
pranavm-nvidiacommented, Nov 16, 2021

@mfoglio It’s either a bug in the CalDetection/YoloLayerPlugin implementation or within TensorRT. Could you see if you’re able to reproduce this with a newer version of TensorRT (like 8.2)?

Also, I think it’s safe to rule out cupy, so maybe we should move the discussion to a TensorRT issue?

1reaction
mfogliocommented, Nov 16, 2021

Ok, so I saved the preprocessed input into a binary file:

        with cp.cuda.Stream(non_blocking=False) as stream:
            # Copy cupy array to the buffer
            # input_images = cp.array(batch_input_image)
            input_images = cp.load("preprocessed_input.npy")
            cp.copyto(cuda_inputs[0], input_images)
            # Run inference.
            context.execute_async(bindings=bindings, stream_handle=stream.ptr, batch_size=len(batch_input_image))
            # Copy results from the buffer
            output_images = cuda_outputs[0].copy()
            # Split results into batch
            list_output = cp.split(output_images, indices_or_sections=len(batch_input_image), axis=0)
            # Squeeze output arrays to remove axis of length one
            list_output = [cp.squeeze(array) for array in list_output]
            # Synchronize the stream
            stream.synchronize()

As for the initialization, it is the following:

        # Deserialize the engine from file
        with open(self.engine_path, "rb") as f:
            engine = runtime.deserialize_cuda_engine(f.read())
        context = engine.create_execution_context()

        cuda_inputs = []
        cuda_outputs = []
        bindings = []

        for binding in engine:
            shape = [engine.max_batch_size, *engine.get_binding_shape(binding)]
            dtype = trt.nptype(engine.get_binding_dtype(binding))
            # Allocate host and device buffers
            cuda_mem = cp.zeros(shape=shape, dtype=dtype)  # cuda.mem_alloc(host_mem.nbytes)
            # Append the device buffer to device bindings.
            bindings.append(cuda_mem.data.ptr)
            # Append to the appropriate list.
            if engine.binding_is_input(binding):
                cuda_inputs.append(cuda_mem)
            else:
                cuda_outputs.append(cuda_mem)
Read more comments on GitHub >

github_iconTop Results From Across the Web

Cuda streams synchronization issue with cupy and tensorRT
I am working with TensorRT and cupy. The following code does not wait for the cuda calls too be executed if I set...
Read more >
NVIDIA Deep Learning TensorRT Documentation
This NVIDIA TensorRT Developer Guide demonstrates how to use the C++ and Python APIs for implementing the most common deep learning layers.
Read more >
CUDA Streams: Best Practices and Common Pitfalls
Multiple processes (e.g. MPI) on a single GPU could not operate concurrently ... Synchronous: enqueue work and wait for completion.
Read more >
[D] Should We Be Using JAX in 2022? : r/MachineLearning
JAX can be incredibly fast and, while it's a no-brainer for certain ... Bonus ex-industry perspective: PyTorch (libtorch) + TensorRT + C++ ...
Read more >
NVIDIA TensorRT - manuals.plus
TensorRT's network definition does not deep-copy parameter arrays (such as the ... To wait for completion of asynchronous execution, synchronize on the ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found