question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Dose data_prefetcher() really speed up training?

See original GitHub issue

I used your Python code https://github.com/NVIDIA/apex/blob/master/examples/imagenet/main_amp.py#L256

My code is

I think this data_prefetcher could speed up training, because there is another stream sending data to GPU memory when model is running in the GPU. So there is a very small gap between two iteration.

However, this trick does not work for me. So please help me

  • data_prefetcher is really for speed up?
  • why it dosen’t work for my case

Issue Analytics

  • State:open
  • Created 4 years ago
  • Comments:21 (5 by maintainers)

github_iconTop GitHub Comments

42reactions
mcarillicommented, May 17, 2019

By default, Pytorch enqueues all operations involving the gpu (kernel launches, cpu->gpu memcopies, and gpu->cpu memcopies) on the same stream (the “default stream”). Operations on the same stream are serialized and can never overlap. For two operations to overlap, they must be in different streams. Also, for cpu->gpu and gpu->cpu memcopies in particular, the CPU-side memory must be pinned, otherwise the memcopy will be blocking with respect to all streams.

The forward pass is performed in the default stream. Therefore, for a cpu->gpu prefetch (of the next iteration’s data) to overlap with the forward pass of the current iteration

  1. the data batch on the cpu must be pinned, and
  2. the prefetch must be carried out in a side stream.

Our data_prefetcher satisfies both of these requirements.

For overlapped prefetching, supplying pin_memory=True to the dataloader is always required (to satisfy 1.). If your data batch is a tuple of Tensors, then supplying pin_memory=True + using the prefetcher should be enough to enable overlap. If any element of your data batch tuple wraps its Tensors in a custom class, you must also supply a pin_memory method to that custom class that the dataloader will call to ensure the batch’s cpu-side memory is pinned (to satisfy 2.) as I said in my previous post.

I’m not sure why the dataloading time doubles for a 2-node run. Are the files of your dataset only on one node’s hard drive, and are they being accessed from the other node via a shared network drive or something? This would mean that one node is slower than another. For best results, the full dataset’s files should be present on the hard drive of both nodes.

3reactions
DelightRuncommented, Aug 20, 2020

After do some deep profiling, I think the problem is due to the record_stream(torch.cuda.current_stream()), which will block the default stream until copy complete. Here is what PyTorch’s document says about record_stream(link`:

Ensures that the tensor memory is not reused for another tensor until all current work queued on stream are complete.

Notices that “until all current work queued on stream are complete.”. It means the tensor memory will not be reused if current stream is stilling working. But for some reason, it blocks the default stream until the tensor’s stream is completed.

Here is the timeline: image As you can see, the default stream is blocked until HtoD copy is done.

After remove the record_stream: image

Read more comments on GitHub >

github_iconTop Results From Across the Web

Data Prefetching in Deep Learning | JP - Jungkyu Park
Today I will explain a method that can further speed up your training, provided that you already achieved sufficient data loading efficiency ...
Read more >
Better performance with the tf.data API | TensorFlow Core
Prefetching overlaps the preprocessing and model execution of a training step. While the model is executing training step s , the input pipeline...
Read more >
Tips and tricks to optimize your data pipeline using Tensorflow
Tensorflow lets us prefetch the data while our model is trained using the prefetching function. Prefetching overlaps the preprocessing and model ...
Read more >
How to prefetch data when processing with GPU?
DistributedDataParallel has always improved performance for me. Remember to (down)scale your worker processes per training process accordingly.
Read more >
Maximizing Hardware Prefetch Effectiveness with Machine ...
can help the user gain up to 96% of the achievable speedup ... of data prefetching in hardware to hide the memory-access latency....
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found