Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Fast transfer of small tensors

See original GitHub issue

Hello again,

After reviewing your example benchmark script, I was doing some measurements on CPU->GPU->CPU transfer times, comparing Pytorch CPU pinned tensor versus Speedtorch’s gadgetGPU. My tensors are actually very small compared to your tests cases, at most 20x3 tensors, but I need to transfer them fast enough to allow me make other computations >100Hz.

So far, if I understood how to correctly use SpeedTorch, it seems that Pytorch’s cpu pinned tensors have faster transfer times compared to SpeedTorch DataGadget cpu pinned object. See the graph below, were both histograms corresponds to CPU->GPU->CPU transfer of a 6x3 matrix. The pink histogram corresponds to pytorch’s cpu pinned tensor and the turquoise one to Speedtorch’s DataGadget operations.

torchpin_vs_speedtorch (Units: millisecond)

For my use case, it seems Pytorch’s pinned cpu Tensor has better performance. I would like to ask if this makes sense in your experience and what recommendations could you provide for using SpeedTorch to achieve better performance? My use case implies receiving data in the CPU, transfer to GPU, make various linear algebra operations, and finally getting the result back to the CPU. All this must be performed at 100Hz minimum. So far I have only achieved 70Hz and would like to speed up every operation as much as possible.

You can find the code I used to get this graph here, and this was run on a Jetson Nano (armV8, nVidia Tegra TX1 GPU).

Thank you very much!

Issue Analytics

State:
Created 4 years ago
Comments:5 (3 by maintainers)

Top GitHub Comments

1reaction

juanmedcommented, Sep 24, 2019

@Santosh-Gupta Thanks for your reply back. Yes, that would be 4 cores in the cpu. I will try to confirm if the self.data_cpu.CUPYcorpus = cp.asarray(new_data) would still be using cpu pinned memory and come back with the results. Thanks!

0reactions

Santosh-Guptacommented, Sep 25, 2019

Another approach you might want to consider is using the PyCuda and Numba indexing kernals, using a similar approach, disguising CPU pinned tensors as GPU tensors. I didn’t have a chance to try this approach.