Fast transfer of small tensors
See original GitHub issueHello again,
After reviewing your example benchmark script, I was doing some measurements on CPU->GPU->CPU transfer times, comparing Pytorch CPU pinned tensor versus Speedtorch’s gadgetGPU. My tensors are actually very small compared to your tests cases, at most 20x3 tensors, but I need to transfer them fast enough to allow me make other computations >100Hz.
So far, if I understood how to correctly use SpeedTorch, it seems that Pytorch’s cpu pinned tensors have faster transfer times compared to SpeedTorch DataGadget
cpu pinned object. See the graph below, were both histograms corresponds to CPU->GPU->CPU
transfer of a 6x3 matrix. The pink histogram corresponds to pytorch’s cpu pinned tensor and the turquoise one to Speedtorch’s DataGadget operations.
(Units: millisecond)
For my use case, it seems Pytorch’s pinned cpu Tensor has better performance. I would like to ask if this makes sense in your experience and what recommendations could you provide for using SpeedTorch to achieve better performance? My use case implies receiving data in the CPU, transfer to GPU, make various linear algebra operations, and finally getting the result back to the CPU. All this must be performed at 100Hz minimum. So far I have only achieved 70Hz and would like to speed up every operation as much as possible.
You can find the code I used to get this graph here, and this was run on a Jetson Nano (armV8, nVidia Tegra TX1 GPU).
Thank you very much!
Issue Analytics
- State:
- Created 4 years ago
- Comments:5 (3 by maintainers)
@Santosh-Gupta Thanks for your reply back. Yes, that would be 4 cores in the cpu. I will try to confirm if the
self.data_cpu.CUPYcorpus = cp.asarray(new_data)
would still be using cpu pinned memory and come back with the results. Thanks!Another approach you might want to consider is using the PyCuda and Numba indexing kernals, using a similar approach, disguising CPU pinned tensors as GPU tensors. I didn’t have a chance to try this approach.