[BUG] Compressed Adam optimizers - RuntimeError: Bool type is not supported by dlpack
See original GitHub issueDescribe the bug Traceback:
File "deepspeed/__init__.py", line 119, in initialize
engine = DeepSpeedEngine(args=args,
File "deepspeed/runtime/engine.py", line 293, in __init__
self._configure_optimizer(optimizer, model_parameters)
File "deepspeed/runtime/engine.py", line 1106, in _configure_optimizer
self.optimizer = self._configure_fp16_optimizer(basic_optimizer)
File "deepspeed/runtime/engine.py", line 1243, in _configure_fp16_optimizer
optimizer = FP16_Optimizer(
File "deepspeed/runtime/fp16/fused_optimizer.py", line 111, in __init__
self.initialize_optimizer_states()
File "deepspeed/runtime/fp16/fused_optimizer.py", line 119, in initialize_optimizer_states
self.optimizer.step()
File "torch/optim/optimizer.py", line 88, in wrapper
return func(*args, **kwargs)
File "deepspeed/runtime/fp16/onebit/zoadam.py", line 239, in step
self.comm_backend_handle.compressed_allreduce(
File "deepspeed/runtime/comm/nccl.py", line 72, in compressed_allreduce
self.compression_backend.torch2cupy(buffer_m.sign_().add_(1).bool()),
File "deepspeed/runtime/compression/cupy.py", line 15, in torch2cupy
return cupy.fromDlpack(to_dlpack(tensor))
RuntimeError: Bool type is not supported by dlpack
When using the implementation for OneBitAdam and ZeroOneAdam, the error from the title appears when comm_backend_name
is set to nccl
.
This is linked to the operations: https://github.com/microsoft/DeepSpeed/blob/208d45bbf7cbde2abfb233e7d10803553fbcf126/deepspeed/runtime/comm/nccl.py#L72 https://github.com/microsoft/DeepSpeed/blob/208d45bbf7cbde2abfb233e7d10803553fbcf126/deepspeed/runtime/comm/nccl.py#L129
And the fact that Bool is not supported by dlpack since Pytorch 1.10: see https://github.com/pytorch/pytorch/issues/67081 Google’s JAX repo recommends casting to uint8 instead of bool: https://github.com/google/jax/issues/4719 Beware that, when I tried to implement the casting locally I got terrible performance with ZeroOneAdam.
Expected behavior The ZeroOneAdam optimizer working with nccl and the latest Pytorch version.
ds_report output DeepSpeed general environment info: torch version … 1.11.0+cu115 torch cuda version … 11.5 torch hip version … None nvcc version … 11.4 deepspeed info … 0.6.1+208d45b, 208d45b, master deepspeed wheel compiled w. … torch 1.11, cuda 11.5, hip 0.0
Launcher context Pytorch-Lightning DeepSpeedPlugin, Python 3.8
Issue Analytics
- State:
- Created a year ago
- Comments:7 (4 by maintainers)
Top GitHub Comments
Hi @jhoareau I created a PR based on your suggestion https://github.com/microsoft/DeepSpeed/pull/1894 and I did test that torch 1.10 w/ this fix is able to provide same performance benefit and same training loss curve on BERT pertaining compared with torch 1.8 w/o this fix. Could you try if this PR fix your issue?
I see. Yes we will investigate this on our side. But please understand that because we need to test both performance and convergence and because of bandwidth limitation, this will take some time. Before that I would recommend you to use older Pytorch if possible.