question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Implement GPU Prefetching of data.

See original GitHub issue

🚀 Feature

Copying data from host (CPU) to device (GPU) is a time consuming operation that can cause GPU starvation while the GPU is idle, waiting for data.

On a PCIe 3.0 connection (the most common for GPUs) through the PCI-bus, a 16MB chunk of pinned data moves at approximately 6GB/s. While this is fine for most research operations, this can be a serious bottleneck for industrial use or with large data (such as with 3D volumes).

Also, smaller amounts of data are passed on even slower because of constant overhead.

Motivation

GPU Prefetch is a useful feature that is already in other libraries (e.g. Tensorflow, NVIDIA DALI).

However, these other libraries use graph mode to prefetch their data to GPU.

This is not necessary and a slight adjustment to the Trainer class could allow for prefetching data to GPU without any fancy new library.

This is because Pytorch already has asynchronous GPU transfer available in the .to(non_blocking=True) setting.

All that needs to be done is to unravel the Python for loop into a while loop so that the next mini-batch is sent to GPU(s) while they are busy running the deep learning model.

According to the Pytorch Neurips 2019 paper , Pytorch queues GPU commands while the CPU asynchronously continues with the next piece of host code.

This means that the host-side command to send the data asynchronously to GPU(s) will be given while the GPUs are still running. This allows overlapping of data transfer.

Pitch

A Python for loop is almost always used to iterate over a DataLoader during training.

However, a for loop can be expressed as a while-loop.

Most important, a while loop allows the next mini-batch to be prefetched from CPU to GPU while the current mini-batch is undergoing training on GPU.

Please view on source markdown for a better sense of the code

loader = DataLoader(...);
iterator = iter(loader);
not_finished = True;
next_data = next(iterator);
next_data = next_data.to(device, **non_blocking=True**)
while not_finished:
    data = next_data;
    (model forward step);
    try:
        next_data = next(iterator);
        next_data = next_data.to(device, **non_blocking=True**);
    except StopIteration:
        not_finished = False
    (model backward step)

This is a very general idea and it is not very difficult to implement, though I am not sure if TPUs have asynchronous data transfer.

Alternatives

Integrate NVIDIA DALI for Pytorch into Pytorch Lightning.

Additional context

The code example would prefetch just one mini-batch to GPU while training is going on. It will not send multiple mini-batches.

However, since GPU compute is almost always the bottleneck and CPU bottlenecks can simply be handled by increasing the number of workers, the proposed solution is an adequate one.

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Reactions:5
  • Comments:10 (7 by maintainers)

github_iconTop GitHub Comments

1reaction
ethanwharriscommented, Apr 4, 2020

This is interesting - currently we already do a kind of prefetching in the loop for our support of iterable datasets, this would just require moving the device handling code and setting the non_blocking flag. Presumably we would only do this if the use wants (by adding a flag to trainer like device_prefetch or something) just in case having two batches of data in memory is a problem in some settings. I can take a look at this if needed 😃

1reaction
williamFalconcommented, Apr 4, 2020
  1. I don’t see any issues with converting the for-loop into a while loop if it gets us prefetch.
  2. There was a PR to integrate DALI but the author faded haha. Maybe you want to do the integration (#789)?

@veritas9872

Read more comments on GitHub >

github_iconTop Results From Across the Web

Implement GPU Prefetching of data. · Issue #1316
GPU Prefetch is a useful feature that is already in other libraries (e.g. Tensorflow, NVIDIA DALI). However, these other libraries use graph ...
Read more >
Boosting Application Performance with GPU Memory ...
Prefetching means bringing data closer to the SMs' execution units. Registers are closest of all. If enough are available, which you can ...
Read more >
Data Prefetching in Deep Learning | JP - Jungkyu Park
Implementation #1. The first approach of implementing data prefetcher is using non_blocking=True option just like NVIDIA did in their working ...
Read more >
Stream data prefetcher for the GPU memory interface
The implemented stream prefetcher is responsible for issuing prefetch requests, encoded with the proposed data-pattern descriptors. It also ...
Read more >
How to prefetch data when processing with GPU?
Hi everyone, I'm new to Pytorch/torch. Is there any deme codes for prefetching data with another process during GPU doing computation?
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found