cudashm.get_contents_as_numpy always using gpu 0
See original GitHub issueDescription
While running my code I check nvidia-smi
and there were too many allocated spaces on GPU:0. I have narrowed it down to my return statement that uses cudashm.get_contents_as_numpy()
. Even when I declare cudashm.create_shared_memory_region()
and triton_client.register_cuda_shared_memory()
to id not 0, it still ends up on the card.
Triton Information I am using the 20.06 Triton container (v2)
To Reproduce
Use a GPU system with more than one GPU. Run a Triton client with grpc and cuda shared memory. Set the ouput to be allocated to a gpu that is not id=0. Watch nvidia-smi
to see the memory be allocated for the output on GPU 0.
Expected behavior
I would expect the memory to be allocated on the card specified by cudashm.create_shared_memory_region()
and triton_client.register_cuda_shared_memory()
Issue Analytics
- State:
- Created 3 years ago
- Comments:10 (6 by maintainers)
Top GitHub Comments
The increase appears to be independent of the protocol. I am working on fixing this unexpected increase. It is likely going to solve it for all scenarios.
The 300MB or so increase in GPU memory is caused by the CUDA lazy initialization. It happens whenever the cuda context is created. This can be limited to some extend using cudaDeviceSetLimit() Before the call to the function that creates the CUDA IPC handle, there is no other call so the context gets initialized only when this call is made. As a correction to the earlier statement about C++ native, this is need both for the python + shared library and the C++ native implementation.
Since this is not actionable I am closing the ticket. You can look into alternate workarounds if this is a limiting factor. PS: The CUDA context is created by the first runtime call for the target device – internally, you’ll see a driver call to cuDevicePrimaryCtxRetain. That call will allocate the 300MB (approx). This allocation will hit every device targeted when the respective device is used.