Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

cudashm.get_contents_as_numpy always using gpu 0

See original GitHub issue

Description While running my code I check nvidia-smi and there were too many allocated spaces on GPU:0. I have narrowed it down to my return statement that uses cudashm.get_contents_as_numpy(). Even when I declare cudashm.create_shared_memory_region() and triton_client.register_cuda_shared_memory() to id not 0, it still ends up on the card.

Triton Information I am using the 20.06 Triton container (v2)

To Reproduce Use a GPU system with more than one GPU. Run a Triton client with grpc and cuda shared memory. Set the ouput to be allocated to a gpu that is not id=0. Watch nvidia-smi to see the memory be allocated for the output on GPU 0.

Expected behavior I would expect the memory to be allocated on the card specified by cudashm.create_shared_memory_region() and triton_client.register_cuda_shared_memory()

Issue Analytics

State:
Created 3 years ago
Comments:10 (6 by maintainers)

Top GitHub Comments

1reaction

CoderHamcommented, Sep 10, 2020

The increase appears to be independent of the protocol. I am working on fixing this unexpected increase. It is likely going to solve it for all scenarios.

0reactions

CoderHamcommented, Sep 11, 2020

The 300MB or so increase in GPU memory is caused by the CUDA lazy initialization. It happens whenever the cuda context is created. This can be limited to some extend using cudaDeviceSetLimit() Before the call to the function that creates the CUDA IPC handle, there is no other call so the context gets initialized only when this call is made. As a correction to the earlier statement about C++ native, this is need both for the python + shared library and the C++ native implementation.

Since this is not actionable I am closing the ticket. You can look into alternate workarounds if this is a limiting factor. PS: The CUDA context is created by the first runtime call for the target device – internally, you’ll see a driver call to cuDevicePrimaryCtxRetain. That call will allocate the 300MB (approx). This allocation will hit every device targeted when the respective device is used.