question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Failure when using `scatter` for H2D copy

See original GitHub issue

🐛 Bug

To Reproduce

Apply the following diff to today’s master branch of maskrcnn-benchmark.

diff --git i/maskrcnn_benchmark/data/collate_batch.py w/maskrcnn_benchmark/data/collate_batch.py
index a7f0341..c906712 100644
--- i/maskrcnn_benchmark/data/collate_batch.py
+++ w/maskrcnn_benchmark/data/collate_batch.py
@@ -14,7 +14,7 @@ class BatchCollator(object):
 
     def __call__(self, batch):
         transposed_batch = list(zip(*batch))
-        images = to_image_list(transposed_batch[0], self.size_divisible)
+        images = transposed_batch[0]
         targets = transposed_batch[1]
         img_ids = transposed_batch[2]
         return images, targets, img_ids
diff --git i/maskrcnn_benchmark/engine/trainer.py w/maskrcnn_benchmark/engine/trainer.py
index 38a9e52..be0c9f1 100644
--- i/maskrcnn_benchmark/engine/trainer.py
+++ w/maskrcnn_benchmark/engine/trainer.py
@@ -60,12 +60,26 @@ def do_train(
 
         scheduler.step()
 
-        images = images.to(device)
+        from maskrcnn_benchmark.structures.image_list import to_image_list
+        from torch.nn.parallel.scatter_gather import scatter
+
+        orig_images = images
+
+        images = [scatter(im, [dist.get_rank()])[0] for im in images]  # this fails
+        #images = [im.cuda() for im in images]  # this works
+
+        for orig_im, im in zip(orig_images, images):
+            diff = (orig_im - im.cpu()).abs().max()
+            assert diff < 0.01, diff
+        images = to_image_list(images, 32)
+
         targets = [target.to(device) for target in targets]
 
         loss_dict = model(images, targets)
 
         losses = sum(loss for loss in loss_dict.values())
+        if torch.isnan(losses).any():
+            raise FloatingPointError()
 
         # reduce losses over all GPUs for logging purposes
         loss_dict_reduced = reduce_loss_dict(loss_dict)

Before the diff, the code (1) pad images to imagelist in dataloader (2) copy the imagelist to GPU. After the diff, the code (1) does not pad images in dataloader (2) copy the images to GPU individually (3) pad GPU images to imagelist.

The two should have equivalent behavior, and indeed it works when I use .cuda() to do the H2D copy. However it fails when I use scatter()[0] to do the H2D copy (which is what DistributedDataParallel would use).

If I ran this command on 2 GPUs, after applying the diff:

python -m torch.distributed.launch --nproc_per_node=2 tools/train_net.py --config-file configs/e2e_mask_rcnn_R_50_FPN_1x.yaml SOLVER.IMS_PER_BATCH 4 SOLVER.BASE_LR 0.0001

I observed two types of failures after dozens of iterations:

  1. assert diff < 0.01 failed, which means the data was different after the copy.
  File "/maskrcnn-benchmark/maskrcnn_benchmark/engine/trainer.py", line 73, in do_train
    assert diff < 0.01, diff
AssertionError: tensor(139.0213)
  1. It met NaN and throws the error, which did not happen if I use .cuda() to do the copy.

It seems to be a bug in pytorch’s scatter however I found no clues there. I’m also unable to simplify the repro (simplification make the bug disappear) so I posted it here.

Environment

PyTorch version: 1.0.0 Is debug build: No CUDA used to build PyTorch: 9.0.176

OS: Ubuntu 18.04.1 LTS GCC version: (GCC) 5.3.0 CMake version: version 3.12.2

Python version: 3.6 Is CUDA available: Yes CUDA runtime version: 9.0.176 GPU models and configuration: GPU 0: Quadro GP100 GPU 1: Quadro GP100

Nvidia driver version: 410.79 cuDNN version: Could not collect

Versions of relevant libraries: [pip3] msgpack-numpy==0.4.3.1 [pip3] numpy==1.16.1 [pip3] numpydoc==0.7.0 [pip3] torch==1.0.0 [pip3] torchvision==0.2.1 [conda] blas 1.0 mkl [conda] mkl 2019.1 144 [conda] mkl-include 2019.1 144 [conda] mkl-service 1.1.2 py36he904b0f_5 [conda] mkl_fft 1.0.6 py36hd81dba3_0 [conda] mkl_random 1.0.2 py36hd81dba3_0 [conda] pytorch 1.0.0 py3.6_cuda9.0.176_cudnn7.4.1_1 pytorch [conda] torchvision 0.2.1 py_2 pytorch

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Comments:11 (3 by maintainers)

github_iconTop GitHub Comments

2reactions
mrshenlicommented, Apr 1, 2019

Fixed by pytorch/pytorch/pull/18465

1reaction
ppwwyyxxcommented, Mar 19, 2019

The instruction to reproduce is posted in the original issue. I cannot reproduce it with a simplified example. But it can be reliably reproduced with the original instruction.

Read more comments on GitHub >

github_iconTop Results From Across the Web

CUDA runtime error (59) : device-side assert triggered
I got this error when I was using the Huggingface Transformer model LongformerEncoderDecoder (LED), and setting the decoder length ...
Read more >
NVIDIA DRIVE OS 6.0.4 TensorRT 8.4.11 Documentation
A good first step after exporting a model to ONNX is to run constant folding using Polygraphy. This can often solve TensorRT conversion...
Read more >
CUDA Streams: Best Practices and Common Pitfalls
Copy input data from CPU memory to GPU memory. 2. Launch a GPU Kernel ... cudaMemcpyAsync(H2D) ... Kernel Launches are asynchronous Automatic overlap...
Read more >
PCIe, NVLink, NV-SLI, NVSwitch and GPUDirect - arXiv
reduce scatter and all gather show staggering behavior on bandwidth for NVSwitch; the values are much better with. 2, 4, 8 and 16...
Read more >
GPU offload / optimization for update&constraits, buffer ops and ...
Split H2D copy and spread launch 2. Add getter for the padding, required in coordinates buffer 3. Add the getter for the GPU...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found