"IndexError: tuple index out of range" for the zero_stage=3
See original GitHub issueI am trying to integrate deep-speed into this script and have successfully run it for zero stage 2, but when I tried it for zero stage 3 this error prompts just after completion of the first epoch. I have made changes in the finetune_using_clm.py file as suggested in this huggingface/accelerate repo, and have created a new file tuned.py.
The error for the zero stage 3, indicates to the:
Traceback (most recent call last): File "tuned.py", line 398, in main accelerator.backward(loss)
The whole error is:
Traceback (most recent call last):
File "tuned.py", line 398, in main
accelerator.backward(loss)
File "/usr/local/lib/python3.8/dist-packages/accelerate/accelerator.py", line 1310, in backward
self.deepspeed_engine_wrapped.backward(loss, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/accelerate/utils/deepspeed.py", line 156, in backward
self.engine.backward(loss)
File "/usr/local/lib/python3.8/dist-packages/deepspeed/utils/nvtx.py", line 11, in wrapped_fn
return func(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/engine.py", line 1860, in backward
self.optimizer.backward(loss, retain_graph=retain_graph)
File "/usr/local/lib/python3.8/dist-packages/deepspeed/utils/nvtx.py", line 11, in wrapped_fn
return func(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/zero/stage3.py", line 2070, in backward
self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)
File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/fp16/loss_scaler.py", line 51, in backward
scaled_loss.backward(retain_graph=retain_graph)
File "/usr/local/lib/python3.8/dist-packages/torch/_tensor.py", line 487, in backward
torch.autograd.backward(
File "/usr/local/lib/python3.8/dist-packages/torch/autograd/__init__.py", line 197, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
File "/usr/local/lib/python3.8/dist-packages/torch/autograd/function.py", line 267, in apply
return user_fn(self, *args)
File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/zero/parameter_offload.py", line 144, in backward
ctx.pre_backward_function(ctx.module)
File "/usr/local/lib/python3.8/dist-packages/deepspeed/utils/nvtx.py", line 11, in wrapped_fn
return func(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/zero/parameter_offload.py", line 392, in _run_before_backward_function
self.pre_sub_module_backward_function(sub_module)
File "/usr/local/lib/python3.8/dist-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/zero/parameter_offload.py", line 487, in pre_sub_module_backward_function
param_coordinator.trace_prologue(sub_module)
File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 147, in trace_prologue
if sub_module != self.__submodule_order[self.__step_id]:
IndexError: tuple index out of range
I donβt know why it gives this error as it is running well while using the zero stage 2.
Any help in this regard would be highly appreciated.
I am using Google Colab for the task.
Packages version: mpi4py-3.1.4 deepspeed-0.7.6 accelerate-0.15.0 transformers-4.25.1
Issue Analytics
- State:
- Created 9 months ago
- Comments:16
Hi, Thank you so much, @pacman100! It is okay now. Thanks again for taking out time to the issue. Means a lot!
Hello @asifehmad, I made the changes that I suggested above to get following code which works fine. In conf, i set
concatenate_raw: true
. Accelerate version0.0.15.dev
, DeepSpeed version0.7.7
, PyTorch version1.14.0.dev20221117+cu117
and transformers version4.23.0.dev0
.Command I ran on 2 A100 GPUs
Output logs:
Therefore, I am unable to reproduce the error. Hope this helps.