Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

OneBitAdam Incompatible with Pipeline Parallelism

See original GitHub issue

So after a bit of work we finally got 1-bit Adam working over at https://github.com/EleutherAI/gpt-neox

But it seems not to be compatible with Pipeline Parallelism. My hypothesis is that the WaitAll here https://github.com/microsoft/DeepSpeed/blob/master/deepspeed/runtime/custom_collectives.py#L99 is waiting for all ranks to reach that point, but since the pipeline stages don’t all execute concurrently, some never reach it, and it errors out.

The actual error code / stacktrace I’m getting is:

    engine = PipelineEngine(args=args,
  File "/src/deepspeed/deepspeed/runtime/pipe/engine.py", line 52, in __init__
    super().__init__(*super_args, **super_kwargs)
  File "/src/deepspeed/deepspeed/runtime/engine.py", line 174, in __init__
    self._configure_optimizer(optimizer, model_parameters)
  File "/src/deepspeed/deepspeed/runtime/engine.py", line 572, in _configure_optimizer
    self.optimizer = self._configure_fp16_optimizer(basic_optimizer)
  File "/src/deepspeed/deepspeed/runtime/engine.py", line 628, in _configure_fp16_optimizer
    optimizer = FP16_Optimizer(
  File "/src/deepspeed/deepspeed/runtime/fp16/fused_optimizer.py", line 104, in __init__
    self.initialize_optimizer_states()
  File "/src/deepspeed/deepspeed/runtime/fp16/fused_optimizer.py", line 112, in initialize_optimizer_states
    self.optimizer.step()
  File "/src/deepspeed/deepspeed/runtime/fp16/onebit_adam.py", line 340, in step
    self.Compressed_Allreduce(exp_avg,
  File "/src/deepspeed/deepspeed/runtime/fp16/onebit_adam.py", line 157, in Compressed_Allreduce
    cupy_sign_list_packed, cupy_recvbuf_sign, cupy_worker_scale, cupy_recvbuf_scale = gather_host(rank,
  File "/src/deepspeed/deepspeed/runtime/custom_collectives.py", line 99, in gather_host
    MPI.Request.Waitall(requests)
  File "mpi4py/MPI/Request.pyx", line 124, in mpi4py.MPI.Request.Waitall
mpi4py.MPI.Exception: MPI_ERR_IN_STATUS: error code in status

To test my hypothesis I ran a pipeline model with a single stage (so no actual parallelism, but still using the Pipeline Module / Pipeline Engine classes), and this works fine.

I wonder if this is something you’ve encountered, or potentially something that’s fixed by https://github.com/microsoft/DeepSpeed/pull/817 ?

@conglongli ?

Issue Analytics

State:
Created 3 years ago
Comments:17 (16 by maintainers)

Top GitHub Comments

1reaction

sdtblckcommented, Apr 17, 2021

Sure @conglongli happy to close this.

I’ll make sure to try with larger models at some point in the future and report back when i do.

1reaction

conglonglicommented, Apr 17, 2021

Thanks @sdtblck, it looks good to me so I have merged it into 1-bit LAMB PR. On our side we will add a unit test and apply the same change to 1-bit LAMB optimizer. For MPI implementation, we might leave it as it is (and document the limitation), because the NCCL implementation has superior usability and performance, so we really don’t recommend to use MPI implementation (and currently not enough bandwidth to maintain it).

Another thing is that do you mind if I close this issue after merging everything into master? There still remains an open-ended question about “what kind of models can benefit from 1-bit Adam/communication compression in general”. We can reopen this issue or create new issue if any of us has new findings about it 😃

Top Results From Across the Web

1-bit adam and model parallelism · Issue #781 - GitHub

I have noticed that using 1-bit adam freezes up training process after the model is loaded. When I run ds_pretrain_gpt2_onebit.sh with mp=2 ...

Pipeline Parallelism — DeepSpeed 0.8.0 documentation

Pipeline parallelism is not compatible with ZeRO-2 and ZeRO-3. Parameters. layers (Iterable) – A sequence of layers defining pipeline structure.

DeepSpeed Configuration JSON

DeepSpeed natively supports Adam, AdamW, OneBitAdam, Lamb, and OneBitLamb ... (more copying between optimizer steps) or GPU count (increased parallelism).

https://aiqianji.com/openoker/gpt-neox/raw/main/co...

N.B - ZeRO stages 2+ are incompatible with pipeline parallelism. ... N.B - `OneBitAdam` requires you to use deepspeed's internal lr scheduler because ......

Model Parallelism - Hugging Face

Each gpu processes in parallel different stages of the pipeline and working on a small chunk of the batch. Zero Redundancy Optimizer (ZeRO)...