DeepSpeed v1.6.1 issues with Pytorch Lightning Deep Speed 3 Offloading
See original GitHub issueDescribe the bug DeepSpeed Stage 3 with Offloading is throwing an error. This error is not thrown in DeepSpeed version 0.5.8. (See Screenshots section). I am currently using PyTorch 1.8.2, but the error is also thrown in PyTorch 1.11.0.
To Reproduce
I am using a BERT model from transformers v.4.17 with a Sequence Classification tuning head. I have wrapped this model in a Pytorch-Lightning module. I am training this model with the deepspeed_stage_3_offload
strategy.
from pytorch_lightning import Trainer
model = get_model()
dm = get_data_module()
trainer = Trainer(
max_epochs=10,
accelerator="gpu",
devices=3,
strategy="deepspeed_stage_3_offload",
auto_lr_find=True,
logger=logger,
)
trainer.fit(model, dm)
Expected behavior The model trains
ds_report output
❯ ds_report
[2022-04-02 23:14:34,784] [WARNING] [partition_parameters.py:54:<module>] unable to find torch.distributed._all_gather_base. will fall back to torch.distributed.all_gather which will result in suboptimal performance. please consider upgrading your pytorch installation.
--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
runtime if needed. Op compatibility means that your system
meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
cpu_adam ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
[WARNING] using untested triton version (1.1.1), only 1.0.0 is known to be compatible
sparse_attn ............ [NO] ....... [NO]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
async_io ............... [NO] ....... [OKAY]
utils .................. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/home/keshav/mambaforge/envs/old_pytorch/lib/python3.7/site-packages/torch']
torch version .................... 1.8.2+cu111
torch cuda version ............... 11.1
torch hip version ................ None
nvcc version ..................... 11.1
deepspeed install path ........... ['/home/keshav/mambaforge/envs/old_pytorch/lib/python3.7/site-packages/deepspeed']
deepspeed info ................... 0.6.1, unknown, unknown
deepspeed wheel compiled w. ...... torch 1.8, cuda 11.1, hip 0.0
Screenshots
Traceback (most recent call last):
File "/home/keshav/mambaforge/envs/dl/lib/python3.9/runpy.py", line 197, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/keshav/mambaforge/envs/dl/lib/python3.9/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/mnt/storage/grid/home/keshav/projects/Ibis/experiments/dependency_tests/training_with_deepspeed_stage_3_offload.py", line 48, in <module>
train()
File "/mnt/storage/grid/home/keshav/projects/Ibis/experiments/dependency_tests/training_with_deepspeed_stage_3_offload.py", line 39, in train
trainer.fit(model, dm)
File "/home/keshav/mambaforge/envs/dl/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 771, in fit
self._call_and_handle_interrupt(
File "/home/keshav/mambaforge/envs/dl/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 722, in _call_and_handle_interrupt
return self.strategy.launcher.launch(trainer_fn, *args, trainer=self, **kwargs)
File "/home/keshav/mambaforge/envs/dl/lib/python3.9/site-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 93, in launch
return function(*args, **kwargs)
File "/home/keshav/mambaforge/envs/dl/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 812, in _fit_impl
results = self._run(model, ckpt_path=self.ckpt_path)
File "/home/keshav/mambaforge/envs/dl/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1237, in _run
results = self._run_stage()
File "/home/keshav/mambaforge/envs/dl/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1324, in _run_stage
return self._run_train()
File "/home/keshav/mambaforge/envs/dl/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1354, in _run_train
self.fit_loop.run()
File "/home/keshav/mambaforge/envs/dl/lib/python3.9/site-packages/pytorch_lightning/loops/base.py", line 204, in run
self.advance(*args, **kwargs)
File "/home/keshav/mambaforge/envs/dl/lib/python3.9/site-packages/pytorch_lightning/loops/fit_loop.py", line 269, in advance
self._outputs = self.epoch_loop.run(self._data_fetcher)
File "/home/keshav/mambaforge/envs/dl/lib/python3.9/site-packages/pytorch_lightning/loops/base.py", line 204, in run
self.advance(*args, **kwargs)
File "/home/keshav/mambaforge/envs/dl/lib/python3.9/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 208, in advance
batch_output = self.batch_loop.run(batch, batch_idx)
File "/home/keshav/mambaforge/envs/dl/lib/python3.9/site-packages/pytorch_lightning/loops/base.py", line 204, in run
self.advance(*args, **kwargs)
File "/home/keshav/mambaforge/envs/dl/lib/python3.9/site-packages/pytorch_lightning/loops/batch/training_batch_loop.py", line 88, in advance
outputs = self.optimizer_loop.run(split_batch, optimizers, batch_idx)
File "/home/keshav/mambaforge/envs/dl/lib/python3.9/site-packages/pytorch_lightning/loops/base.py", line 204, in run
self.advance(*args, **kwargs)
File "/home/keshav/mambaforge/envs/dl/lib/python3.9/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 203, in advance
result = self._run_optimization(
File "/home/keshav/mambaforge/envs/dl/lib/python3.9/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 256, in _run_optimization
self._optimizer_step(optimizer, opt_idx, batch_idx, closure)
File "/home/keshav/mambaforge/envs/dl/lib/python3.9/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 369, in _optimizer_step
self.trainer._call_lightning_module_hook(
File "/home/keshav/mambaforge/envs/dl/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1596, in _call_lightning_module_hook
output = fn(*args, **kwargs)
File "/home/keshav/mambaforge/envs/dl/lib/python3.9/site-packages/pytorch_lightning/core/lightning.py", line 1625, in optimizer_step
optimizer.step(closure=optimizer_closure)
File "/home/keshav/mambaforge/envs/dl/lib/python3.9/site-packages/pytorch_lightning/core/optimizer.py", line 168, in step
step_output = self._strategy.optimizer_step(self._optimizer, self._optimizer_idx, closure, **kwargs)
File "/home/keshav/mambaforge/envs/dl/lib/python3.9/site-packages/pytorch_lightning/strategies/ddp.py", line 278, in optimizer_step
optimizer_output = super().optimizer_step(optimizer, opt_idx, closure, model, **kwargs)
File "/home/keshav/mambaforge/envs/dl/lib/python3.9/site-packages/pytorch_lightning/strategies/strategy.py", line 193, in optimizer_step
return self.precision_plugin.optimizer_step(model, optimizer, opt_idx, closure, **kwargs)
File "/home/keshav/mambaforge/envs/dl/lib/python3.9/site-packages/pytorch_lightning/plugins/precision/deepspeed.py", line 70, in optimizer_step
closure_result = closure()
File "/home/keshav/mambaforge/envs/dl/lib/python3.9/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 148, in __call__
self._result = self.closure(*args, **kwargs)
File "/home/keshav/mambaforge/envs/dl/lib/python3.9/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 143, in closure
self._backward_fn(step_output.closure_loss)
File "/home/keshav/mambaforge/envs/dl/lib/python3.9/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 311, in backward_fn
self.trainer._call_strategy_hook("backward", loss, optimizer, opt_idx)
File "/home/keshav/mambaforge/envs/dl/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1766, in _call_strategy_hook
output = fn(*args, **kwargs)
File "/home/keshav/mambaforge/envs/dl/lib/python3.9/site-packages/pytorch_lightning/strategies/strategy.py", line 168, in backward
self.precision_plugin.backward(self.lightning_module, closure_loss, *args, **kwargs)
File "/home/keshav/mambaforge/envs/dl/lib/python3.9/site-packages/pytorch_lightning/plugins/precision/deepspeed.py", line 51, in backward
deepspeed_engine.backward(closure_loss, *args, **kwargs)
File "/home/keshav/mambaforge/envs/dl/lib/python3.9/site-packages/deepspeed/utils/nvtx.py", line 11, in wrapped_fn
return func(*args, **kwargs)
File "/home/keshav/mambaforge/envs/dl/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1667, in backward
self.optimizer.backward(loss)
File "/home/keshav/mambaforge/envs/dl/lib/python3.9/site-packages/deepspeed/utils/nvtx.py", line 11, in wrapped_fn
return func(*args, **kwargs)
File "/home/keshav/mambaforge/envs/dl/lib/python3.9/site-packages/deepspeed/runtime/zero/stage3.py", line 2793, in backward
self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)
File "/home/keshav/mambaforge/envs/dl/lib/python3.9/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 53, in backward
scaled_loss.backward(retain_graph=retain_graph)
File "/home/keshav/mambaforge/envs/dl/lib/python3.9/site-packages/torch/_tensor.py", line 363, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
File "/home/keshav/mambaforge/envs/dl/lib/python3.9/site-packages/torch/autograd/__init__.py", line 173, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
File "/home/keshav/mambaforge/envs/dl/lib/python3.9/site-packages/torch/autograd/function.py", line 253, in apply
return user_fn(self, *args)
File "/home/keshav/mambaforge/envs/dl/lib/python3.9/site-packages/deepspeed/runtime/zero/stage3.py", line 510, in backward
ctx.pre_backward_function(ctx.module)
File "/home/keshav/mambaforge/envs/dl/lib/python3.9/site-packages/deepspeed/utils/nvtx.py", line 11, in wrapped_fn
return func(*args, **kwargs)
File "/home/keshav/mambaforge/envs/dl/lib/python3.9/site-packages/deepspeed/runtime/zero/stage3.py", line 1461, in _run_before_backward_function
self.pre_sub_module_backward_function(sub_module)
File "/home/keshav/mambaforge/envs/dl/lib/python3.9/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/home/keshav/mambaforge/envs/dl/lib/python3.9/site-packages/deepspeed/runtime/zero/stage3.py", line 1549, in pre_sub_module_backward_function
self.param_coordinator.fetch_sub_module(sub_module)
File "/home/keshav/mambaforge/envs/dl/lib/python3.9/site-packages/deepspeed/utils/nvtx.py", line 11, in wrapped_fn
return func(*args, **kwargs)
File "/home/keshav/mambaforge/envs/dl/lib/python3.9/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/home/keshav/mambaforge/envs/dl/lib/python3.9/site-packages/deepspeed/runtime/zero/stage3.py", line 358, in fetch_sub_module
raise RuntimeError(
RuntimeError: tracing error at step 1429: expected the next 2 parameters in the parameter fetch queue to be ({'id': 487, 'status': 'AVAILABLE', 'numel': 1455104, 'ds_numel': 1455104, 'shape': (1421, 1024), 'ds_shape': (1421, 1024), 'requires_grad': True, 'grad_shape': None, 'persist': False, 'active_sub_modules': {557}}, {'id': 488, 'status': 'AVAILABLE', 'numel': 1421, 'ds_numel': 1421, 'shape': (1421,), 'ds_shape': (1421,), 'requires_grad': True, 'grad_shape': None, 'persist': True, 'active_sub_modules': {557}}) but got ().
System info (please complete the following information):
- OS: Ubuntu 20.03
- GPUs: 8 TU104GL [Quadro RTX 5000]
- Single Machine
- Python version: 3.7
- Huggingface’s Transformers 4.17
- DeepSpeed 0.6.1
- Pytorch-Lightning 1.6.0
- PyTorch 1.8.2
Launcher context
Are you launching your experiment with the deepspeed
launcher, MPI, or something else?
I am launching with PyTorch-Lightning. I have written a function that is executed via a python script.
Docker context Are you using a specific docker image that you can share? n/a
Additional context n/a
Issue Analytics
- State:
- Created a year ago
- Reactions:2
- Comments:16 (9 by maintainers)
Top GitHub Comments
Yeah I tried both building the wheel and installing with
pip install .
and still end up with the same error, but that could be something weird with my environment, so if @SeanNaren and @keshavd are able to get it to work, we’re probably good to close out. Any idea when0.6.2
will be coming out @tjruwase and thank you for your help.@keshavd, @jmwoloso, and @SeanNaren can you please test #1901? Thanks!