Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Work with DistributedDataParallel

See original GitHub issue

🐞 Bug

likely a new feature: On a 6 GPU node, group the first 3 as one pipeline, and the rest 3 as the other. Two model replica are communicated using nn.parallel.DistributedDataParallel.

Code that reproduces

        from nn.parallel imort DistributedDataParallel as DDP
        model = GPipe(model, balance=[1, 1, 2], devices=devices, chunks=CSZ)
        model = DDP(model)

the full version can be found here: https://github.com/YHRen/gpipe_demo/blob/master/main.py

Traceback (most recent call last):
Traceback (most recent call last):
  File "main_dist.py", line 155, in <module>
  File "main_dist.py", line 155, in <module>
    loss.backward()
    loss.backward()
  File "/sw/ascent/ibm-wml-ce/anaconda-base/envs/ibm-wml-ce-1.7.0-0/lib/python3.6/site-packages/torch/tensor.py", line 166, in backward
  File "/sw/ascent/ibm-wml-ce/anaconda-base/envs/ibm-wml-ce-1.7.0-0/lib/python3.6/site-packages/torch/tensor.py", line 166, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/sw/ascent/ibm-wml-ce/anaconda-base/envs/ibm-wml-ce-1.7.0-0/lib/python3.6/site-packages/torch/autograd/__init__.py", line 99, in backward
  File "/sw/ascent/ibm-wml-ce/anaconda-base/envs/ibm-wml-ce-1.7.0-0/lib/python3.6/site-packages/torch/autograd/__init__.py", line 99, in backward
    allow_unreachable=True)  # allow_unreachable flag
  File "/sw/ascent/ibm-wml-ce/anaconda-base/envs/ibm-wml-ce-1.7.0-0/lib/python3.6/site-packages/torch/autograd/function.py", line 77, in apply
    allow_unreachable=True)  # allow_unreachable flag
  File "/sw/ascent/ibm-wml-ce/anaconda-base/envs/ibm-wml-ce-1.7.0-0/lib/python3.6/site-packages/torch/autograd/function.py", line 77, in apply
    return self._forward_cls.backward(self, *args)
    return self._forward_cls.backward(self, *args)
  File "/ccsopen/home/yren/.local/lib/python3.6/site-packages/torchgpipe/checkpoint.py", line 269, in backward
  File "/ccsopen/home/yren/.local/lib/python3.6/site-packages/torchgpipe/checkpoint.py", line 269, in backward
    torch.autograd.backward(tensors, grad_output)
    torch.autograd.backward(tensors, grad_output)
  File "/sw/ascent/ibm-wml-ce/anaconda-base/envs/ibm-wml-ce-1.7.0-0/lib/python3.6/site-packages/torch/autograd/__init__.py", line 99, in backward
  File "/sw/ascent/ibm-wml-ce/anaconda-base/envs/ibm-wml-ce-1.7.0-0/lib/python3.6/site-packages/torch/autograd/__init__.py", line 99, in backward
    allow_unreachable=True)  # allow_unreachable flag
RuntimeError: has_marked_unused_parameters_ INTERNAL ASSERT FAILED at /opt/anaconda/conda-bld/pytorch-base_1594299597148/work/torch/csrc/distributed/c10d/reducer.cpp:290, please report a bug to PyTorch. 
    allow_unreachable=True)  # allow_unreachable flag
RuntimeError: has_marked_unused_parameters_ INTERNAL ASSERT FAILED at /opt/anaconda/conda-bld/pytorch-base_1594299597148/work/torch/csrc/distributed/c10d/reducer.cpp:290, please report a bug to PyTorch. 
Traceback (most recent call last):
  File "/sw/ascent/ibm-wml-ce/anaconda-base/envs/ibm-wml-ce-1.7.0-0/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/sw/ascent/ibm-wml-ce/anaconda-base/envs/ibm-wml-ce-1.7.0-0/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/sw/ascent/ibm-wml-ce/anaconda-base/envs/ibm-wml-ce-1.7.0-0/lib/python3.6/site-packages/torch/distributed/launch.py", line 253, in <module>
    main()
  File "/sw/ascent/ibm-wml-ce/anaconda-base/envs/ibm-wml-ce-1.7.0-0/lib/python3.6/site-packages/torch/distributed/launch.py", line 249, in main
    cmd=cmd)
subprocess.CalledProcessError: Command '['/sw/ascent/ibm-wml-ce/anaconda-base/envs/ibm-wml-ce-1.7.0-0/bin/python', '-u', 'main_dist.py', '--local_rank=1', '-m', 'cnn', '-b', '32', '-c', '4', '-d', '2048', '-w', '128', '-l', '5', '-e', '2', '--dist', '--gpus_per_group', '3', '--group_per_node', '2']' returned non-zero exit status 1.

If we wrap another way around:

        model = model.cuda() # default cuda id has been handled per rank. 
        model = DDP(model)
        model = GPipe(model, balance=[1, 1, 2], devices=devices, chunks=CSZ)

The error will complain all tensors must be on the same device.

Traceback (most recent call last):
  File "main_dist.py", line 135, in <module>
    model = DDP(model)
  File "/sw/ascent/ibm-wml-ce/anaconda-base/envs/ibm-wml-ce-1.7.0-0/lib/python3.6/site-packages/torch/nn/parallel/distributed.py", line 325, in __init__
    self._ddp_init_helper()
  File "/sw/ascent/ibm-wml-ce/anaconda-base/envs/ibm-wml-ce-1.7.0-0/lib/python3.6/site-packages/torch/nn/parallel/distributed.py", line 343, in _ddp_init_helper
    self._module_copies = replicate(self.module, self.device_ids, detach=True)
  File "/sw/ascent/ibm-wml-ce/anaconda-base/envs/ibm-wml-ce-1.7.0-0/lib/python3.6/site-packages/torch/nn/parallel/replicate.py", line 96, in replicate
    param_copies = _broadcast_coalesced_reshape(params, devices, detach)
  File "/sw/ascent/ibm-wml-ce/anaconda-base/envs/ibm-wml-ce-1.7.0-0/lib/python3.6/site-packages/torch/nn/parallel/replicate.py", line 75, in _broadcast_coalesced_reshape
    return comm.broadcast_coalesced(tensors, devices)
  File "/sw/ascent/ibm-wml-ce/anaconda-base/envs/ibm-wml-ce-1.7.0-0/lib/python3.6/site-packages/torch/cuda/comm.py", line 39, in broadcast_coalesced
    return torch._C._broadcast_coalesced(tensors, devices, buffer_size)
RuntimeError: all tensors must be on devices[0]

Environment

GPU: _CudaDeviceProperties(name='Tesla V100-SXM2-16GB', major=7, minor=0, total_memory=16128MB, multi_processor_count=80)
Number of GPUs: 6
CUDA: 10.2.89
cuDNN: 7605
Python: 3.6.10
PyTorch: 1.3.1
torchgpipe: 0.0.6

Additional context

I’m wondering if it is possible to run “DDP” for each split in the model pipeline.

Thank you.

Issue Analytics

State:
Created 3 years ago
Comments:5

Top GitHub Comments

1reaction

YHRencommented, Aug 25, 2020

Dear @sublee and @chiheonk

I have tested your suggested solution of using checkpoint='never' . I confirm it works without any errors.

My minimum demo is here: https://github.com/YHRen/gpipe_demo

I really like your work and thank you so much for your responses.

Please feel free to close the issue.

0reactions

chiheonkcommented, Aug 23, 2020

Sorry for the late response.

I see. If I understand correctly, the recompute (checkpoint) for previous chunk will be “destroyed” by the next chunk of microbatch.

Yes. Computation graph of a partition will be built and destroyed during backpropagation for several times (due to checkpointing), hence multiple gradient accumulations on the corresponding parameters will occur and it causes the error in DDP.

This looks like a promising way. The effect would be maximized memory footprint, correct?

Yes.

Could you instruct me how to set checkpoint='never'?

Simply change your code as follows:

model = GPipe(model, balance=[1, 1, 2], devices=devices, chunks=CSZ, checkpoint='never')

Top Results From Across the Web

Getting Started with Distributed Data Parallel - PyTorch

First, DataParallel is single-process, multi-thread, and only works on a single machine, while DistributedDataParallel is multi-process and works for both ...

How distributed training works in Pytorch - AI Summer

Learn how distributed training works in pytorch: data parallel, distributed data parallel and automatic mixed precision.

A Comprehensive Tutorial to Pytorch DistributedDataParallel

I hope this blog will help them to avoid horrible bugs and mistakes. I'm not going to include detailed explanation of how DDP...

Distributed data parallel training in Pytorch

Pytorch provides a tutorial on distributed training using AWS, which does a pretty good job of showing you how to set things up...

PyTorch DistributedDataParallel Example In Azure ML

Overview · Prerequisites · Training Script. A. Specify GPU To Use; B. Prepare For Distributed Training; C. · Submit Job Script. 1. Get...