OneBitAdam Incompatible with Pipeline Parallelism
See original GitHub issueSo after a bit of work we finally got 1-bit Adam working over at https://github.com/EleutherAI/gpt-neox
But it seems not to be compatible with Pipeline Parallelism. My hypothesis is that the WaitAll here https://github.com/microsoft/DeepSpeed/blob/master/deepspeed/runtime/custom_collectives.py#L99 is waiting for all ranks to reach that point, but since the pipeline stages don’t all execute concurrently, some never reach it, and it errors out.
The actual error code / stacktrace I’m getting is:
engine = PipelineEngine(args=args,
File "/src/deepspeed/deepspeed/runtime/pipe/engine.py", line 52, in __init__
super().__init__(*super_args, **super_kwargs)
File "/src/deepspeed/deepspeed/runtime/engine.py", line 174, in __init__
self._configure_optimizer(optimizer, model_parameters)
File "/src/deepspeed/deepspeed/runtime/engine.py", line 572, in _configure_optimizer
self.optimizer = self._configure_fp16_optimizer(basic_optimizer)
File "/src/deepspeed/deepspeed/runtime/engine.py", line 628, in _configure_fp16_optimizer
optimizer = FP16_Optimizer(
File "/src/deepspeed/deepspeed/runtime/fp16/fused_optimizer.py", line 104, in __init__
self.initialize_optimizer_states()
File "/src/deepspeed/deepspeed/runtime/fp16/fused_optimizer.py", line 112, in initialize_optimizer_states
self.optimizer.step()
File "/src/deepspeed/deepspeed/runtime/fp16/onebit_adam.py", line 340, in step
self.Compressed_Allreduce(exp_avg,
File "/src/deepspeed/deepspeed/runtime/fp16/onebit_adam.py", line 157, in Compressed_Allreduce
cupy_sign_list_packed, cupy_recvbuf_sign, cupy_worker_scale, cupy_recvbuf_scale = gather_host(rank,
File "/src/deepspeed/deepspeed/runtime/custom_collectives.py", line 99, in gather_host
MPI.Request.Waitall(requests)
File "mpi4py/MPI/Request.pyx", line 124, in mpi4py.MPI.Request.Waitall
mpi4py.MPI.Exception: MPI_ERR_IN_STATUS: error code in status
To test my hypothesis I ran a pipeline model with a single stage (so no actual parallelism, but still using the Pipeline Module / Pipeline Engine classes), and this works fine.
I wonder if this is something you’ve encountered, or potentially something that’s fixed by https://github.com/microsoft/DeepSpeed/pull/817 ?
Issue Analytics
- State:
- Created 3 years ago
- Comments:17 (16 by maintainers)
Top GitHub Comments
Sure @conglongli happy to close this.
I’ll make sure to try with larger models at some point in the future and report back when i do.
Thanks @sdtblck, it looks good to me so I have merged it into 1-bit LAMB PR. On our side we will add a unit test and apply the same change to 1-bit LAMB optimizer. For MPI implementation, we might leave it as it is (and document the limitation), because the NCCL implementation has superior usability and performance, so we really don’t recommend to use MPI implementation (and currently not enough bandwidth to maintain it).
Another thing is that do you mind if I close this issue after merging everything into master? There still remains an open-ended question about “what kind of models can benefit from 1-bit Adam/communication compression in general”. We can reopen this issue or create new issue if any of us has new findings about it 😃