Dose data_prefetcher() really speed up training?
See original GitHub issueI used your Python code https://github.com/NVIDIA/apex/blob/master/examples/imagenet/main_amp.py#L256
My code is
- https://github.com/zhangpzh/maskrcnn-benchmark/blob/Falcon/tools/train_net.py
- replace this for-each-loop by your
while True
- pytorch: torch.nn.parallel.DistributedDataParallel
- 8 GPUs
I think this data_prefetcher
could speed up training, because there is another stream sending data to GPU memory when model is running in the GPU. So there is a very small gap between two iteration.
However, this trick does not work for me. So please help me
data_prefetcher
is really for speed up?- why it dosen’t work for my case
Issue Analytics
- State:
- Created 4 years ago
- Comments:21 (5 by maintainers)
Top Results From Across the Web
Data Prefetching in Deep Learning | JP - Jungkyu Park
Today I will explain a method that can further speed up your training, provided that you already achieved sufficient data loading efficiency ...
Read more >Better performance with the tf.data API | TensorFlow Core
Prefetching overlaps the preprocessing and model execution of a training step. While the model is executing training step s , the input pipeline...
Read more >Tips and tricks to optimize your data pipeline using Tensorflow
Tensorflow lets us prefetch the data while our model is trained using the prefetching function. Prefetching overlaps the preprocessing and model ...
Read more >How to prefetch data when processing with GPU?
DistributedDataParallel has always improved performance for me. Remember to (down)scale your worker processes per training process accordingly.
Read more >Maximizing Hardware Prefetch Effectiveness with Machine ...
can help the user gain up to 96% of the achievable speedup ... of data prefetching in hardware to hide the memory-access latency....
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
By default, Pytorch enqueues all operations involving the gpu (kernel launches, cpu->gpu memcopies, and gpu->cpu memcopies) on the same stream (the “default stream”). Operations on the same stream are serialized and can never overlap. For two operations to overlap, they must be in different streams. Also, for cpu->gpu and gpu->cpu memcopies in particular, the CPU-side memory must be pinned, otherwise the memcopy will be blocking with respect to all streams.
The forward pass is performed in the default stream. Therefore, for a cpu->gpu prefetch (of the next iteration’s data) to overlap with the forward pass of the current iteration
Our data_prefetcher satisfies both of these requirements.
For overlapped prefetching, supplying pin_memory=True to the dataloader is always required (to satisfy 1.). If your data batch is a tuple of Tensors, then supplying
pin_memory=True
+ using the prefetcher should be enough to enable overlap. If any element of your data batch tuple wraps its Tensors in a custom class, you must also supply apin_memory
method to that custom class that the dataloader will call to ensure the batch’s cpu-side memory is pinned (to satisfy 2.) as I said in my previous post.I’m not sure why the dataloading time doubles for a 2-node run. Are the files of your dataset only on one node’s hard drive, and are they being accessed from the other node via a shared network drive or something? This would mean that one node is slower than another. For best results, the full dataset’s files should be present on the hard drive of both nodes.
After do some deep profiling, I think the problem is due to the
record_stream(torch.cuda.current_stream())
, which will block the default stream until copy complete. Here is what PyTorch’s document says aboutrecord_stream
(link`:Notices that “until all current work queued on stream are complete.”. It means the tensor memory will not be reused if current stream is stilling working. But for some reason, it blocks the default stream until the tensor’s stream is completed.
Here is the timeline:
As you can see, the default stream is blocked until HtoD copy is done.
After remove the
record_stream
: