Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Dose data_prefetcher() really speed up training?

See original GitHub issue

I used your Python code https://github.com/NVIDIA/apex/blob/master/examples/imagenet/main_amp.py#L256

My code is

https://github.com/zhangpzh/maskrcnn-benchmark/blob/Falcon/tools/train_net.py
replace this for-each-loop by your while True
pytorch: torch.nn.parallel.DistributedDataParallel
8 GPUs

I think this data_prefetcher could speed up training, because there is another stream sending data to GPU memory when model is running in the GPU. So there is a very small gap between two iteration.

However, this trick does not work for me. So please help me

data_prefetcher is really for speed up?
why it dosen’t work for my case

Issue Analytics

State:
Created 4 years ago
Comments:21 (5 by maintainers)

Top GitHub Comments

42reactions

mcarillicommented, May 17, 2019

By default, Pytorch enqueues all operations involving the gpu (kernel launches, cpu->gpu memcopies, and gpu->cpu memcopies) on the same stream (the “default stream”). Operations on the same stream are serialized and can never overlap. For two operations to overlap, they must be in different streams. Also, for cpu->gpu and gpu->cpu memcopies in particular, the CPU-side memory must be pinned, otherwise the memcopy will be blocking with respect to all streams.

The forward pass is performed in the default stream. Therefore, for a cpu->gpu prefetch (of the next iteration’s data) to overlap with the forward pass of the current iteration

the data batch on the cpu must be pinned, and
the prefetch must be carried out in a side stream.

Our data_prefetcher satisfies both of these requirements.

For overlapped prefetching, supplying pin_memory=True to the dataloader is always required (to satisfy 1.). If your data batch is a tuple of Tensors, then supplying pin_memory=True + using the prefetcher should be enough to enable overlap. If any element of your data batch tuple wraps its Tensors in a custom class, you must also supply a pin_memory method to that custom class that the dataloader will call to ensure the batch’s cpu-side memory is pinned (to satisfy 2.) as I said in my previous post.

I’m not sure why the dataloading time doubles for a 2-node run. Are the files of your dataset only on one node’s hard drive, and are they being accessed from the other node via a shared network drive or something? This would mean that one node is slower than another. For best results, the full dataset’s files should be present on the hard drive of both nodes.

3reactions

DelightRuncommented, Aug 20, 2020

After do some deep profiling, I think the problem is due to the record_stream(torch.cuda.current_stream()), which will block the default stream until copy complete. Here is what PyTorch’s document says about record_stream(link`:

Ensures that the tensor memory is not reused for another tensor until all current work queued on stream are complete.

Notices that “until all current work queued on stream are complete.”. It means the tensor memory will not be reused if current stream is stilling working. But for some reason, it blocks the default stream until the tensor’s stream is completed.

Here is the timeline: As you can see, the default stream is blocked until HtoD copy is done.