question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Fast transfer of small tensors

See original GitHub issue

Hello again,

After reviewing your example benchmark script, I was doing some measurements on CPU->GPU->CPU transfer times, comparing Pytorch CPU pinned tensor versus Speedtorch’s gadgetGPU. My tensors are actually very small compared to your tests cases, at most 20x3 tensors, but I need to transfer them fast enough to allow me make other computations >100Hz.

So far, if I understood how to correctly use SpeedTorch, it seems that Pytorch’s cpu pinned tensors have faster transfer times compared to SpeedTorch DataGadget cpu pinned object. See the graph below, were both histograms corresponds to CPU->GPU->CPU transfer of a 6x3 matrix. The pink histogram corresponds to pytorch’s cpu pinned tensor and the turquoise one to Speedtorch’s DataGadget operations.

torchpin_vs_speedtorch (Units: millisecond)

For my use case, it seems Pytorch’s pinned cpu Tensor has better performance. I would like to ask if this makes sense in your experience and what recommendations could you provide for using SpeedTorch to achieve better performance? My use case implies receiving data in the CPU, transfer to GPU, make various linear algebra operations, and finally getting the result back to the CPU. All this must be performed at 100Hz minimum. So far I have only achieved 70Hz and would like to speed up every operation as much as possible.

You can find the code I used to get this graph here, and this was run on a Jetson Nano (armV8, nVidia Tegra TX1 GPU).

Thank you very much!

Issue Analytics

  • State:open
  • Created 4 years ago
  • Comments:5 (3 by maintainers)

github_iconTop GitHub Comments

1reaction
juanmedcommented, Sep 24, 2019

@Santosh-Gupta Thanks for your reply back. Yes, that would be 4 cores in the cpu. I will try to confirm if the self.data_cpu.CUPYcorpus = cp.asarray(new_data) would still be using cpu pinned memory and come back with the results. Thanks!

0reactions
Santosh-Guptacommented, Sep 25, 2019

Another approach you might want to consider is using the PyCuda and Numba indexing kernals, using a similar approach, disguising CPU pinned tensors as GPU tensors. I didn’t have a chance to try this approach.

Read more comments on GitHub >

github_iconTop Results From Across the Web

7 Tips To Maximize PyTorch Performance | by William Falcon
Construct tensors directly on GPUs​​ However, this first creates CPU tensor, and THEN transfers it to GPU… this is really slow. Instead, create ......
Read more >
PyTorch preferred way to copy a tensor - python - Stack Overflow
TL;DR. Use .clone().detach() (or preferrably .detach().clone() ). If you first detach the tensor and then clone it, the computation path is ...
Read more >
How to Optimize Data Transfers in CUDA C/C++
Batching many small transfers into one larger transfer performs ... On my desktop PC with a much faster Intel Core i7-3930K CPU (3.2...
Read more >
Efficient Training on a Single GPU - Hugging Face
To see how much it is we load a tiny tensor into the GPU which triggers the ... on CPU and typically leads...
Read more >
Speed Up Model Training - PyTorch Lightning - Read the Docs
LightningModules know what device they are on! Construct tensors on the device directly to avoid CPU->Device transfer. # bad t = ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found