question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

DeepSpeed v1.6.1 issues with Pytorch Lightning Deep Speed 3 Offloading

See original GitHub issue

Describe the bug DeepSpeed Stage 3 with Offloading is throwing an error. This error is not thrown in DeepSpeed version 0.5.8. (See Screenshots section). I am currently using PyTorch 1.8.2, but the error is also thrown in PyTorch 1.11.0.

To Reproduce I am using a BERT model from transformers v.4.17 with a Sequence Classification tuning head. I have wrapped this model in a Pytorch-Lightning module. I am training this model with the deepspeed_stage_3_offload strategy.

from pytorch_lightning import Trainer

model = get_model()
dm = get_data_module()

trainer = Trainer(
        max_epochs=10,
        accelerator="gpu",
        devices=3,
        strategy="deepspeed_stage_3_offload",
        auto_lr_find=True,
        logger=logger,
)
trainer.fit(model, dm)

Expected behavior The model trains

ds_report output

❯ ds_report
[2022-04-02 23:14:34,784] [WARNING] [partition_parameters.py:54:<module>] unable to find torch.distributed._all_gather_base. will fall back to torch.distributed.all_gather which will result in suboptimal performance. please consider upgrading your pytorch installation.
--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
cpu_adam ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
 [WARNING]  using untested triton version (1.1.1), only 1.0.0 is known to be compatible
sparse_attn ............ [NO] ....... [NO]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
async_io ............... [NO] ....... [OKAY]
utils .................. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/home/keshav/mambaforge/envs/old_pytorch/lib/python3.7/site-packages/torch']
torch version .................... 1.8.2+cu111
torch cuda version ............... 11.1
torch hip version ................ None
nvcc version ..................... 11.1
deepspeed install path ........... ['/home/keshav/mambaforge/envs/old_pytorch/lib/python3.7/site-packages/deepspeed']
deepspeed info ................... 0.6.1, unknown, unknown
deepspeed wheel compiled w. ...... torch 1.8, cuda 11.1, hip 0.0

Screenshots

Traceback (most recent call last):
  File "/home/keshav/mambaforge/envs/dl/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/keshav/mambaforge/envs/dl/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/mnt/storage/grid/home/keshav/projects/Ibis/experiments/dependency_tests/training_with_deepspeed_stage_3_offload.py", line 48, in <module>
    train()
  File "/mnt/storage/grid/home/keshav/projects/Ibis/experiments/dependency_tests/training_with_deepspeed_stage_3_offload.py", line 39, in train
    trainer.fit(model, dm)
  File "/home/keshav/mambaforge/envs/dl/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 771, in fit
    self._call_and_handle_interrupt(
  File "/home/keshav/mambaforge/envs/dl/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 722, in _call_and_handle_interrupt
    return self.strategy.launcher.launch(trainer_fn, *args, trainer=self, **kwargs)
  File "/home/keshav/mambaforge/envs/dl/lib/python3.9/site-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 93, in launch
    return function(*args, **kwargs)
  File "/home/keshav/mambaforge/envs/dl/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 812, in _fit_impl
    results = self._run(model, ckpt_path=self.ckpt_path)
  File "/home/keshav/mambaforge/envs/dl/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1237, in _run
    results = self._run_stage()
  File "/home/keshav/mambaforge/envs/dl/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1324, in _run_stage
    return self._run_train()
  File "/home/keshav/mambaforge/envs/dl/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1354, in _run_train
    self.fit_loop.run()
  File "/home/keshav/mambaforge/envs/dl/lib/python3.9/site-packages/pytorch_lightning/loops/base.py", line 204, in run
    self.advance(*args, **kwargs)
  File "/home/keshav/mambaforge/envs/dl/lib/python3.9/site-packages/pytorch_lightning/loops/fit_loop.py", line 269, in advance
    self._outputs = self.epoch_loop.run(self._data_fetcher)
  File "/home/keshav/mambaforge/envs/dl/lib/python3.9/site-packages/pytorch_lightning/loops/base.py", line 204, in run
    self.advance(*args, **kwargs)
  File "/home/keshav/mambaforge/envs/dl/lib/python3.9/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 208, in advance
    batch_output = self.batch_loop.run(batch, batch_idx)
  File "/home/keshav/mambaforge/envs/dl/lib/python3.9/site-packages/pytorch_lightning/loops/base.py", line 204, in run
    self.advance(*args, **kwargs)
  File "/home/keshav/mambaforge/envs/dl/lib/python3.9/site-packages/pytorch_lightning/loops/batch/training_batch_loop.py", line 88, in advance
    outputs = self.optimizer_loop.run(split_batch, optimizers, batch_idx)
  File "/home/keshav/mambaforge/envs/dl/lib/python3.9/site-packages/pytorch_lightning/loops/base.py", line 204, in run
    self.advance(*args, **kwargs)
  File "/home/keshav/mambaforge/envs/dl/lib/python3.9/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 203, in advance
    result = self._run_optimization(
  File "/home/keshav/mambaforge/envs/dl/lib/python3.9/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 256, in _run_optimization
    self._optimizer_step(optimizer, opt_idx, batch_idx, closure)
  File "/home/keshav/mambaforge/envs/dl/lib/python3.9/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 369, in _optimizer_step
    self.trainer._call_lightning_module_hook(
  File "/home/keshav/mambaforge/envs/dl/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1596, in _call_lightning_module_hook
    output = fn(*args, **kwargs)
  File "/home/keshav/mambaforge/envs/dl/lib/python3.9/site-packages/pytorch_lightning/core/lightning.py", line 1625, in optimizer_step
    optimizer.step(closure=optimizer_closure)
  File "/home/keshav/mambaforge/envs/dl/lib/python3.9/site-packages/pytorch_lightning/core/optimizer.py", line 168, in step
    step_output = self._strategy.optimizer_step(self._optimizer, self._optimizer_idx, closure, **kwargs)
  File "/home/keshav/mambaforge/envs/dl/lib/python3.9/site-packages/pytorch_lightning/strategies/ddp.py", line 278, in optimizer_step
    optimizer_output = super().optimizer_step(optimizer, opt_idx, closure, model, **kwargs)
  File "/home/keshav/mambaforge/envs/dl/lib/python3.9/site-packages/pytorch_lightning/strategies/strategy.py", line 193, in optimizer_step
    return self.precision_plugin.optimizer_step(model, optimizer, opt_idx, closure, **kwargs)
  File "/home/keshav/mambaforge/envs/dl/lib/python3.9/site-packages/pytorch_lightning/plugins/precision/deepspeed.py", line 70, in optimizer_step
    closure_result = closure()
  File "/home/keshav/mambaforge/envs/dl/lib/python3.9/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 148, in __call__
    self._result = self.closure(*args, **kwargs)
  File "/home/keshav/mambaforge/envs/dl/lib/python3.9/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 143, in closure
    self._backward_fn(step_output.closure_loss)
  File "/home/keshav/mambaforge/envs/dl/lib/python3.9/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 311, in backward_fn
    self.trainer._call_strategy_hook("backward", loss, optimizer, opt_idx)
  File "/home/keshav/mambaforge/envs/dl/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1766, in _call_strategy_hook
    output = fn(*args, **kwargs)
  File "/home/keshav/mambaforge/envs/dl/lib/python3.9/site-packages/pytorch_lightning/strategies/strategy.py", line 168, in backward
    self.precision_plugin.backward(self.lightning_module, closure_loss, *args, **kwargs)
  File "/home/keshav/mambaforge/envs/dl/lib/python3.9/site-packages/pytorch_lightning/plugins/precision/deepspeed.py", line 51, in backward
    deepspeed_engine.backward(closure_loss, *args, **kwargs)
  File "/home/keshav/mambaforge/envs/dl/lib/python3.9/site-packages/deepspeed/utils/nvtx.py", line 11, in wrapped_fn
    return func(*args, **kwargs)
  File "/home/keshav/mambaforge/envs/dl/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1667, in backward
    self.optimizer.backward(loss)
  File "/home/keshav/mambaforge/envs/dl/lib/python3.9/site-packages/deepspeed/utils/nvtx.py", line 11, in wrapped_fn
    return func(*args, **kwargs)
  File "/home/keshav/mambaforge/envs/dl/lib/python3.9/site-packages/deepspeed/runtime/zero/stage3.py", line 2793, in backward
    self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)
  File "/home/keshav/mambaforge/envs/dl/lib/python3.9/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 53, in backward
    scaled_loss.backward(retain_graph=retain_graph)
  File "/home/keshav/mambaforge/envs/dl/lib/python3.9/site-packages/torch/_tensor.py", line 363, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/home/keshav/mambaforge/envs/dl/lib/python3.9/site-packages/torch/autograd/__init__.py", line 173, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
  File "/home/keshav/mambaforge/envs/dl/lib/python3.9/site-packages/torch/autograd/function.py", line 253, in apply
    return user_fn(self, *args)
  File "/home/keshav/mambaforge/envs/dl/lib/python3.9/site-packages/deepspeed/runtime/zero/stage3.py", line 510, in backward
    ctx.pre_backward_function(ctx.module)
  File "/home/keshav/mambaforge/envs/dl/lib/python3.9/site-packages/deepspeed/utils/nvtx.py", line 11, in wrapped_fn
    return func(*args, **kwargs)
  File "/home/keshav/mambaforge/envs/dl/lib/python3.9/site-packages/deepspeed/runtime/zero/stage3.py", line 1461, in _run_before_backward_function
    self.pre_sub_module_backward_function(sub_module)
  File "/home/keshav/mambaforge/envs/dl/lib/python3.9/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/home/keshav/mambaforge/envs/dl/lib/python3.9/site-packages/deepspeed/runtime/zero/stage3.py", line 1549, in pre_sub_module_backward_function
    self.param_coordinator.fetch_sub_module(sub_module)
  File "/home/keshav/mambaforge/envs/dl/lib/python3.9/site-packages/deepspeed/utils/nvtx.py", line 11, in wrapped_fn
    return func(*args, **kwargs)
  File "/home/keshav/mambaforge/envs/dl/lib/python3.9/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/home/keshav/mambaforge/envs/dl/lib/python3.9/site-packages/deepspeed/runtime/zero/stage3.py", line 358, in fetch_sub_module
    raise RuntimeError(
RuntimeError: tracing error at step 1429: expected the next 2 parameters in the parameter fetch queue to be ({'id': 487, 'status': 'AVAILABLE', 'numel': 1455104, 'ds_numel': 1455104, 'shape': (1421, 1024), 'ds_shape': (1421, 1024), 'requires_grad': True, 'grad_shape': None, 'persist': False, 'active_sub_modules': {557}}, {'id': 488, 'status': 'AVAILABLE', 'numel': 1421, 'ds_numel': 1421, 'shape': (1421,), 'ds_shape': (1421,), 'requires_grad': True, 'grad_shape': None, 'persist': True, 'active_sub_modules': {557}}) but got ().

System info (please complete the following information):

  • OS: Ubuntu 20.03
  • GPUs: 8 TU104GL [Quadro RTX 5000]
  • Single Machine
  • Python version: 3.7
  • Huggingface’s Transformers 4.17
  • DeepSpeed 0.6.1
  • Pytorch-Lightning 1.6.0
  • PyTorch 1.8.2

Launcher context Are you launching your experiment with the deepspeed launcher, MPI, or something else?

I am launching with PyTorch-Lightning. I have written a function that is executed via a python script.

Docker context Are you using a specific docker image that you can share? n/a

Additional context n/a

Issue Analytics

  • State:closed
  • Created a year ago
  • Reactions:2
  • Comments:16 (9 by maintainers)

github_iconTop GitHub Comments

2reactions
jmwolosocommented, Apr 20, 2022

Yeah I tried both building the wheel and installing with pip install . and still end up with the same error, but that could be something weird with my environment, so if @SeanNaren and @keshavd are able to get it to work, we’re probably good to close out. Any idea when 0.6.2 will be coming out @tjruwase and thank you for your help.

2reactions
tjruwasecommented, Apr 20, 2022

@keshavd, @jmwoloso, and @SeanNaren can you please test #1901? Thanks!

Read more comments on GitHub >

github_iconTop Results From Across the Web

pytorch_lightning.strategies.deepspeed - PyTorch Lightning
Source code for pytorch_lightning.strategies.deepspeed ... offload_parameters: When using ZeRO Stage 3, Enable offloading parameter memory and computation ...
Read more >
DeepSpeed Integration - Hugging Face
DeepSpeed ZeRO-3 can be used for inference as well, since it allows huge models to be loaded on multiple GPUs, which won't be...
Read more >
Accessible Multi-Billion Parameter Model Training with ...
PyTorch Lighting provides quick access to DeepSpeed through the Lightning Trainer. This post shows how to train large deep learning models in a...
Read more >
PyTorch - CC Doc
PyTorch is a Python package that provides two high-level features: Tensor computation (like NumPy) with strong GPU acceleration; Deep neural ...
Read more >
PyTorch Lightning V1.2.0- DeepSpeed, Pruning, Quantization ...
As always, feel free to reach out on Slack or discussions for any questions you might have or issues you are facing. PyTorch...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found