Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Cannot train M2M100 using run_translation.py and DeepSpeed ZeRO stage 3

See original GitHub issue

Environment info

transformers version: 4.18.0
Platform: Linux
Python version: 3.8.12
PyTorch version (GPU?): 1.10
Tensorflow version (GPU?): -
Using GPU in script?: Yes
Using distributed or parallel set-up in script?: Deepspeed ZeRO stage 3

Library Versions:

deepspeed 0.6.1
transformers 4.18.0
pytorch 1.10

Who can help

@stas00

Information

The problem arises when:

I try to finetune the Hugging Face facebook/m2m100_418M model using the run_translation.py script under transformers/examples/pytorch/translation/run_translation.py and deepspeed ZeRO stage 3. If I use t5-small instead of facebook/m2m100_418M then the model trains. Also, if I use facebook/m2m100_418M and ds_config_zero2.json instead of ds_config_zero3.json, then the models trains again.

To reproduce

deepspeed run_translation.py \
--deepspeed ds_config_zero3.json \
--model_name_or_path facebook/m2m100_418M \
--per_device_train_batch_size 8 \
--per_device_eval_batch_size 8 \
--output_dir output_dir --overwrite_output_dir \
--fp16 \
--do_train --do_eval --do_predict \
--max_train_samples 500 --max_eval_samples 50 --max_predict_samples 50 \
--num_train_epochs 3 \
--dataset_name wmt16 --dataset_config "ro-en" \
--source_lang en --target_lang ro \
--predict_with_generate --forced_bos_token ro

where:

run_translation.py is the same file as in transformers/examples/pytorch/translation/run_translation.py
ds_config_zero3.json is the same file as in transformers/tests/deepspeed/ds_config_zero3.json

Error:

Traceback (most recent call last):
  File "run_translation.py", line 636, in <module>
    main()
  File "run_translation.py", line 553, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/opt/conda/lib/python3.8/site-packages/transformers/trainer.py", line 1422, in train
    tr_loss_step = self.training_step(model, inputs)
  File "/opt/conda/lib/python3.8/site-packages/transformers/trainer.py", line 2011, in training_step
    loss = self.compute_loss(model, inputs)
  File "/opt/conda/lib/python3.8/site-packages/transformers/trainer.py", line 2043, in compute_loss
    outputs = model(**inputs)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/deepspeed/utils/nvtx.py", line 11, in wrapped_fn
    return func(*args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1556, in forward
    loss = self.module(*inputs, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1120, in _call_impl
    result = forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/transformers/models/m2m_100/modeling_m2m_100.py", line 1306, in forward
    outputs = self.model(
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1120, in _call_impl
    result = forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/transformers/models/m2m_100/modeling_m2m_100.py", line 1164, in forward
    encoder_outputs = self.encoder(
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1120, in _call_impl
    result = forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/transformers/models/m2m_100/modeling_m2m_100.py", line 819, in forward
    layer_outputs = encoder_layer(
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1120, in _call_impl
    result = forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/transformers/models/m2m_100/modeling_m2m_100.py", line 379, in forward
    hidden_states = self.self_attn_layer_norm(hidden_states)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1109, in _call_impl
    result = hook(self, input)
  File "/opt/conda/lib/python3.8/site-packages/deepspeed/utils/nvtx.py", line 11, in wrapped_fn
    return func(*args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/zero/stage3.py", line 1411, in _pre_forward_module_hook
    self.pre_sub_module_forward_function(module)
  File "/opt/conda/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 28, in decorate_context
    return func(*args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/zero/stage3.py", line 1528, in pre_sub_module_forward_function
    self.param_coordinator.fetch_sub_module(sub_module)
  File "/opt/conda/lib/python3.8/site-packages/deepspeed/utils/nvtx.py", line 11, in wrapped_fn
    return func(*args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 28, in decorate_context
    return func(*args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/zero/stage3.py", line 358, in fetch_sub_module
    raise RuntimeError(
RuntimeError: tracing error at step 42: expected the next 2 parameters in the parameter fetch queue to be ({'id': 26, 'status': 'AVAILABLE', 'numel': 1024, 'ds_numel': 1024, 'shape': (1024,), 'ds_shape': (1024,), 'requires_grad': True, 'grad_shape': None, 'persist': True, 'active_sub_modules': {24}}, {'id': 27, 'status': 'AVAILABLE', 'numel': 1024, 'ds_numel': 1024, 'shape': (1024,), 'ds_shape': (1024,), 'requires_grad': True, 'grad_shape': None, 'persist': True, 'active_sub_modules': {24}}) but got ({'id': 115, 'status': 'NOT_AVAILABLE', 'numel': 0, 'ds_numel': 1024, 'shape': (0,), 'ds_shape': (1024,), 'requires_grad': True, 'grad_shape': None, 'persist': True, 'active_sub_modules': set()}, {'id': 116, 'status': 'NOT_AVAILABLE', 'numel': 0, 'ds_numel': 1048576, 'shape': (0,), 'ds_shape': (1024, 1024), 'requires_grad': True, 'grad_shape': None, 'persist': False, 'active_sub_modules': set()}).
  1%|█                                                                                                                                                                                             | 1/189 [00:01<04:33,  1.45s/it]
[2022-04-10 20:34:32,488] [INFO] [launch.py:178:sigkill_handler] Killing subprocess 41615
[2022-04-10 20:34:32,488] [ERROR] [launch.py:184:sigkill_handler] ['/opt/conda/bin/python3.8', '-u', 'run_translation.py', '--local_rank=0', '--deepspeed', 'config/ds_config_zero3.json', '--model_name_or_path', 'facebook/m2m100_418M', '--per_device_train_batch_size', '8', '--per_device_eval_batch_size', '8', '--output_dir', 'output_dir', '--overwrite_output_dir', '--fp16', '--do_train', '--do_eval', '--do_predict', '--max_train_samples', '500', '--max_eval_samples', '50', '--max_predict_samples', '50', '--num_train_epochs', '3', '--dataset_name', 'wmt16', '--dataset_config', 'ro-en', '--source_lang', 'en', '--target_lang', 'ro', '--predict_with_generate', '--forced_bos_token', 'ro'] exits with return code = 1

Expected behavior

The model trains.

Additional info

Changing deepspeed version from 0.6.1 to 0.5.10 and transformers version from 4.18.0 to 4.16.2, results in the following error:

Traceback (most recent call last):
  File "run_translation.py", line 636, in <module>
    main()
  File "run_translation.py", line 553, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/opt/conda/lib/python3.8/site-packages/transformers/trainer.py", line 1365, in train
    tr_loss_step = self.training_step(model, inputs)
  File "/opt/conda/lib/python3.8/site-packages/transformers/trainer.py", line 1956, in training_step
    loss = self.deepspeed.backward(loss)
  File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1697, in backward
    self.optimizer.backward(loss)
  File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/zero/stage3.py", line 2944, in backward
    self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)
  File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 53, in backward
    scaled_loss.backward(retain_graph=retain_graph)
  File "/opt/conda/lib/python3.8/site-packages/torch/_tensor.py", line 307, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/opt/conda/lib/python3.8/site-packages/torch/autograd/__init__.py", line 154, in backward
    Variable._execution_engine.run_backward(
  File "/opt/conda/lib/python3.8/site-packages/torch/autograd/function.py", line 199, in apply
    return user_fn(self, *args)
  File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/zero/stage3.py", line 562, in backward
    ctx.pre_backward_function(ctx.module)
  File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/zero/stage3.py", line 1456, in _run_before_backward_function
    self.pre_sub_module_backward_function(sub_module)
  File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/zero/stage3.py", line 1551, in pre_sub_module_backward_function
    self.param_coordinator.prefetch_next_sub_modules(sub_module,
  File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/zero/stage3.py", line 358, in prefetch_next_sub_modules
    params_to_prefetch = self.prefetch_coordinator.get_params_to_prefetch(
  File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/zero/stage3.py", line 220, in get_params_to_prefetch
    if sub_module.id != self.sub_module_trace[self.step_id]:
IndexError: list index out of range
  1%|█                                                                                                                                                                                             | 1/189 [00:01<04:02,  1.29s/it]
[2022-04-10 20:44:02,482] [INFO] [launch.py:160:sigkill_handler] Killing subprocess 45884
[2022-04-10 20:44:02,482] [ERROR] [launch.py:166:sigkill_handler] ['/opt/conda/bin/python3.8', '-u', 'run_translation.py', '--local_rank=0', '--deepspeed', 'config/ds_config_zero3.json', '--model_name_or_path', 'facebook/m2m100_418M', '--per_device_train_batch_size', '8', '--per_device_eval_batch_size', '8', '--output_dir', 'output_dir', '--overwrite_output_dir', '--fp16', '--do_train', '--do_eval', '--do_predict', '--max_train_samples', '500', '--max_eval_samples', '50', '--max_predict_samples', '50', '--num_train_epochs', '3', '--dataset_name', 'wmt16', '--dataset_config', 'ro-en', '--source_lang', 'en', '--target_lang', 'ro', '--predict_with_generate', '--forced_bos_token', 'ro'] exits with return code = 1

Issue Analytics

State:
Created a year ago
Comments:9 (6 by maintainers)

Top GitHub Comments

1reaction

evros-chriscommented, Apr 14, 2022

Yes you are right, it does work! Thanks a lot for fixing this @stas00!

0reactions

stas00commented, Apr 13, 2022

update, nope, it works just fine. I have just suggested to you to use PYTHONPATH and haven’t used it myself 😉

Try again with:

git clone https://github.com/huggingface/transformers
cd transformers
git checkout ds-m2m-layerdrop
PYTHONPATH=src deepspeed examples/pytorch/translation/run_translation.py --deepspeed tests/deepspeed/ds_config_zero3.json --model_name_or_path facebook/m2m100_418M --per_device_train_batch_size 8 --per_device_eval_batch_size 8 --output_dir output_dir --overwrite_output_dir --fp16 --do_train --do_eval --do_predict --max_train_samples 500 --max_eval_samples 50 --max_predict_samples 50 --num_train_epochs 3 --dataset_name wmt16 --dataset_config "ro-en" --source_lang en --target_lang ro --predict_with_generate --forced_bos_token ro

Top Results From Across the Web

DeepSpeed - Release 0.8.0 Microsoft

Useful in scenarios where the user wants to use torch distributed calls before calling deepspeed.initialize(), such as when using model paral- lelism, pipeline ......