question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

cudashm.get_contents_as_numpy always using gpu 0

See original GitHub issue

Description While running my code I check nvidia-smi and there were too many allocated spaces on GPU:0. I have narrowed it down to my return statement that uses cudashm.get_contents_as_numpy(). Even when I declare cudashm.create_shared_memory_region() and triton_client.register_cuda_shared_memory() to id not 0, it still ends up on the card.

Triton Information I am using the 20.06 Triton container (v2)

To Reproduce Use a GPU system with more than one GPU. Run a Triton client with grpc and cuda shared memory. Set the ouput to be allocated to a gpu that is not id=0. Watch nvidia-smi to see the memory be allocated for the output on GPU 0.

Expected behavior I would expect the memory to be allocated on the card specified by cudashm.create_shared_memory_region() and triton_client.register_cuda_shared_memory()

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:10 (6 by maintainers)

github_iconTop GitHub Comments

1reaction
CoderHamcommented, Sep 10, 2020

The increase appears to be independent of the protocol. I am working on fixing this unexpected increase. It is likely going to solve it for all scenarios.

0reactions
CoderHamcommented, Sep 11, 2020

The 300MB or so increase in GPU memory is caused by the CUDA lazy initialization. It happens whenever the cuda context is created. This can be limited to some extend using cudaDeviceSetLimit() Before the call to the function that creates the CUDA IPC handle, there is no other call so the context gets initialized only when this call is made. As a correction to the earlier statement about C++ native, this is need both for the python + shared library and the C++ native implementation.

Since this is not actionable I am closing the ticket. You can look into alternate workarounds if this is a limiting factor. PS: The CUDA context is created by the first runtime call for the target device – internally, you’ll see a driver call to cuDevicePrimaryCtxRetain. That call will allocate the 300MB (approx). This allocation will hit every device targeted when the respective device is used.

Read more comments on GitHub >

github_iconTop Results From Across the Web

No results found

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found