question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[BUG] Compressed Adam optimizers - RuntimeError: Bool type is not supported by dlpack

See original GitHub issue

Describe the bug Traceback:

  File "deepspeed/__init__.py", line 119, in initialize
    engine = DeepSpeedEngine(args=args,
  File "deepspeed/runtime/engine.py", line 293, in __init__
    self._configure_optimizer(optimizer, model_parameters)
  File "deepspeed/runtime/engine.py", line 1106, in _configure_optimizer
    self.optimizer = self._configure_fp16_optimizer(basic_optimizer)
  File "deepspeed/runtime/engine.py", line 1243, in _configure_fp16_optimizer
    optimizer = FP16_Optimizer(
  File "deepspeed/runtime/fp16/fused_optimizer.py", line 111, in __init__
    self.initialize_optimizer_states()
  File "deepspeed/runtime/fp16/fused_optimizer.py", line 119, in initialize_optimizer_states
    self.optimizer.step()
  File "torch/optim/optimizer.py", line 88, in wrapper
    return func(*args, **kwargs)
  File "deepspeed/runtime/fp16/onebit/zoadam.py", line 239, in step
    self.comm_backend_handle.compressed_allreduce(
  File "deepspeed/runtime/comm/nccl.py", line 72, in compressed_allreduce
    self.compression_backend.torch2cupy(buffer_m.sign_().add_(1).bool()),
  File "deepspeed/runtime/compression/cupy.py", line 15, in torch2cupy
    return cupy.fromDlpack(to_dlpack(tensor))
RuntimeError: Bool type is not supported by dlpack

When using the implementation for OneBitAdam and ZeroOneAdam, the error from the title appears when comm_backend_name is set to nccl.

This is linked to the operations: https://github.com/microsoft/DeepSpeed/blob/208d45bbf7cbde2abfb233e7d10803553fbcf126/deepspeed/runtime/comm/nccl.py#L72 https://github.com/microsoft/DeepSpeed/blob/208d45bbf7cbde2abfb233e7d10803553fbcf126/deepspeed/runtime/comm/nccl.py#L129

And the fact that Bool is not supported by dlpack since Pytorch 1.10: see https://github.com/pytorch/pytorch/issues/67081 Google’s JAX repo recommends casting to uint8 instead of bool: https://github.com/google/jax/issues/4719 Beware that, when I tried to implement the casting locally I got terrible performance with ZeroOneAdam.

Expected behavior The ZeroOneAdam optimizer working with nccl and the latest Pytorch version.

ds_report output DeepSpeed general environment info: torch version … 1.11.0+cu115 torch cuda version … 11.5 torch hip version … None nvcc version … 11.4 deepspeed info … 0.6.1+208d45b, 208d45b, master deepspeed wheel compiled w. … torch 1.11, cuda 11.5, hip 0.0

Launcher context Pytorch-Lightning DeepSpeedPlugin, Python 3.8

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:7 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
conglonglicommented, Apr 15, 2022

Hi @jhoareau I created a PR based on your suggestion https://github.com/microsoft/DeepSpeed/pull/1894 and I did test that torch 1.10 w/ this fix is able to provide same performance benefit and same training loss curve on BERT pertaining compared with torch 1.8 w/o this fix. Could you try if this PR fix your issue?

1reaction
conglonglicommented, Apr 4, 2022

I see. Yes we will investigate this on our side. But please understand that because we need to test both performance and convergence and because of bandwidth limitation, this will take some time. Before that I would recommend you to use older Pytorch if possible.

Read more comments on GitHub >

github_iconTop Results From Across the Web

How to fix RuntimeError: Bool type is not supported by dlpack
It seems that an error is from torch update to 1.10.0. Reinstalling torch to 1.9.1 works for me. You can reinstall torch in...
Read more >
tf.keras.optimizers.Optimizer | TensorFlow v2.11.0
This class supports distributed training. If you want to implement your own optimizer, please subclass this class instead of _BaseOptimizer.
Read more >
Release 2.5.0
Introduces experimental support for Keras Preprocessing Layers API ( tf.keras.layers.experimental.preprocessing.* ) to handle data preprocessing operations, ...
Read more >
PyTorch 1.11.0 Now Available - Exxact Corporation
... is deprecated # and will throw a runtime error in a future release. ... torch.from_dlpack operation for improved DLPack support (#60627) ...
Read more >
Optimizers — DeepSpeed 0.8.0 documentation - Read the Docs
DeepSpeed offers high-performance implementations of Adam optimizer on CPU; FusedAdam ... amsgrad (boolean, optional) – NOT SUPPORTED in FusedLamb!
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found