question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Cannot train M2M100 using run_translation.py and DeepSpeed ZeRO stage 3

See original GitHub issue

Environment info

  • transformers version: 4.18.0
  • Platform: Linux
  • Python version: 3.8.12
  • PyTorch version (GPU?): 1.10
  • Tensorflow version (GPU?): -
  • Using GPU in script?: Yes
  • Using distributed or parallel set-up in script?: Deepspeed ZeRO stage 3

Library Versions:

  • deepspeed 0.6.1
  • transformers 4.18.0
  • pytorch 1.10

Who can help

@stas00

Information

The problem arises when:

  • I try to finetune the Hugging Face facebook/m2m100_418M model using the run_translation.py script under transformers/examples/pytorch/translation/run_translation.py and deepspeed ZeRO stage 3. If I use t5-small instead of facebook/m2m100_418M then the model trains. Also, if I use facebook/m2m100_418M and ds_config_zero2.json instead of ds_config_zero3.json, then the models trains again.

To reproduce

deepspeed run_translation.py \
--deepspeed ds_config_zero3.json \
--model_name_or_path facebook/m2m100_418M \
--per_device_train_batch_size 8 \
--per_device_eval_batch_size 8 \
--output_dir output_dir --overwrite_output_dir \
--fp16 \
--do_train --do_eval --do_predict \
--max_train_samples 500 --max_eval_samples 50 --max_predict_samples 50 \
--num_train_epochs 3 \
--dataset_name wmt16 --dataset_config "ro-en" \
--source_lang en --target_lang ro \
--predict_with_generate --forced_bos_token ro

where:

  • run_translation.py is the same file as in transformers/examples/pytorch/translation/run_translation.py
  • ds_config_zero3.json is the same file as in transformers/tests/deepspeed/ds_config_zero3.json

Error:

Traceback (most recent call last):
  File "run_translation.py", line 636, in <module>
    main()
  File "run_translation.py", line 553, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/opt/conda/lib/python3.8/site-packages/transformers/trainer.py", line 1422, in train
    tr_loss_step = self.training_step(model, inputs)
  File "/opt/conda/lib/python3.8/site-packages/transformers/trainer.py", line 2011, in training_step
    loss = self.compute_loss(model, inputs)
  File "/opt/conda/lib/python3.8/site-packages/transformers/trainer.py", line 2043, in compute_loss
    outputs = model(**inputs)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/deepspeed/utils/nvtx.py", line 11, in wrapped_fn
    return func(*args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1556, in forward
    loss = self.module(*inputs, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1120, in _call_impl
    result = forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/transformers/models/m2m_100/modeling_m2m_100.py", line 1306, in forward
    outputs = self.model(
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1120, in _call_impl
    result = forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/transformers/models/m2m_100/modeling_m2m_100.py", line 1164, in forward
    encoder_outputs = self.encoder(
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1120, in _call_impl
    result = forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/transformers/models/m2m_100/modeling_m2m_100.py", line 819, in forward
    layer_outputs = encoder_layer(
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1120, in _call_impl
    result = forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/transformers/models/m2m_100/modeling_m2m_100.py", line 379, in forward
    hidden_states = self.self_attn_layer_norm(hidden_states)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1109, in _call_impl
    result = hook(self, input)
  File "/opt/conda/lib/python3.8/site-packages/deepspeed/utils/nvtx.py", line 11, in wrapped_fn
    return func(*args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/zero/stage3.py", line 1411, in _pre_forward_module_hook
    self.pre_sub_module_forward_function(module)
  File "/opt/conda/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 28, in decorate_context
    return func(*args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/zero/stage3.py", line 1528, in pre_sub_module_forward_function
    self.param_coordinator.fetch_sub_module(sub_module)
  File "/opt/conda/lib/python3.8/site-packages/deepspeed/utils/nvtx.py", line 11, in wrapped_fn
    return func(*args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 28, in decorate_context
    return func(*args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/zero/stage3.py", line 358, in fetch_sub_module
    raise RuntimeError(
RuntimeError: tracing error at step 42: expected the next 2 parameters in the parameter fetch queue to be ({'id': 26, 'status': 'AVAILABLE', 'numel': 1024, 'ds_numel': 1024, 'shape': (1024,), 'ds_shape': (1024,), 'requires_grad': True, 'grad_shape': None, 'persist': True, 'active_sub_modules': {24}}, {'id': 27, 'status': 'AVAILABLE', 'numel': 1024, 'ds_numel': 1024, 'shape': (1024,), 'ds_shape': (1024,), 'requires_grad': True, 'grad_shape': None, 'persist': True, 'active_sub_modules': {24}}) but got ({'id': 115, 'status': 'NOT_AVAILABLE', 'numel': 0, 'ds_numel': 1024, 'shape': (0,), 'ds_shape': (1024,), 'requires_grad': True, 'grad_shape': None, 'persist': True, 'active_sub_modules': set()}, {'id': 116, 'status': 'NOT_AVAILABLE', 'numel': 0, 'ds_numel': 1048576, 'shape': (0,), 'ds_shape': (1024, 1024), 'requires_grad': True, 'grad_shape': None, 'persist': False, 'active_sub_modules': set()}).
  1%|█                                                                                                                                                                                             | 1/189 [00:01<04:33,  1.45s/it]
[2022-04-10 20:34:32,488] [INFO] [launch.py:178:sigkill_handler] Killing subprocess 41615
[2022-04-10 20:34:32,488] [ERROR] [launch.py:184:sigkill_handler] ['/opt/conda/bin/python3.8', '-u', 'run_translation.py', '--local_rank=0', '--deepspeed', 'config/ds_config_zero3.json', '--model_name_or_path', 'facebook/m2m100_418M', '--per_device_train_batch_size', '8', '--per_device_eval_batch_size', '8', '--output_dir', 'output_dir', '--overwrite_output_dir', '--fp16', '--do_train', '--do_eval', '--do_predict', '--max_train_samples', '500', '--max_eval_samples', '50', '--max_predict_samples', '50', '--num_train_epochs', '3', '--dataset_name', 'wmt16', '--dataset_config', 'ro-en', '--source_lang', 'en', '--target_lang', 'ro', '--predict_with_generate', '--forced_bos_token', 'ro'] exits with return code = 1

Expected behavior

The model trains.

Additional info

Changing deepspeed version from 0.6.1 to 0.5.10 and transformers version from 4.18.0 to 4.16.2, results in the following error:

Traceback (most recent call last):
  File "run_translation.py", line 636, in <module>
    main()
  File "run_translation.py", line 553, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/opt/conda/lib/python3.8/site-packages/transformers/trainer.py", line 1365, in train
    tr_loss_step = self.training_step(model, inputs)
  File "/opt/conda/lib/python3.8/site-packages/transformers/trainer.py", line 1956, in training_step
    loss = self.deepspeed.backward(loss)
  File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1697, in backward
    self.optimizer.backward(loss)
  File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/zero/stage3.py", line 2944, in backward
    self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)
  File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 53, in backward
    scaled_loss.backward(retain_graph=retain_graph)
  File "/opt/conda/lib/python3.8/site-packages/torch/_tensor.py", line 307, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/opt/conda/lib/python3.8/site-packages/torch/autograd/__init__.py", line 154, in backward
    Variable._execution_engine.run_backward(
  File "/opt/conda/lib/python3.8/site-packages/torch/autograd/function.py", line 199, in apply
    return user_fn(self, *args)
  File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/zero/stage3.py", line 562, in backward
    ctx.pre_backward_function(ctx.module)
  File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/zero/stage3.py", line 1456, in _run_before_backward_function
    self.pre_sub_module_backward_function(sub_module)
  File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/zero/stage3.py", line 1551, in pre_sub_module_backward_function
    self.param_coordinator.prefetch_next_sub_modules(sub_module,
  File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/zero/stage3.py", line 358, in prefetch_next_sub_modules
    params_to_prefetch = self.prefetch_coordinator.get_params_to_prefetch(
  File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/zero/stage3.py", line 220, in get_params_to_prefetch
    if sub_module.id != self.sub_module_trace[self.step_id]:
IndexError: list index out of range
  1%|█                                                                                                                                                                                             | 1/189 [00:01<04:02,  1.29s/it]
[2022-04-10 20:44:02,482] [INFO] [launch.py:160:sigkill_handler] Killing subprocess 45884
[2022-04-10 20:44:02,482] [ERROR] [launch.py:166:sigkill_handler] ['/opt/conda/bin/python3.8', '-u', 'run_translation.py', '--local_rank=0', '--deepspeed', 'config/ds_config_zero3.json', '--model_name_or_path', 'facebook/m2m100_418M', '--per_device_train_batch_size', '8', '--per_device_eval_batch_size', '8', '--output_dir', 'output_dir', '--overwrite_output_dir', '--fp16', '--do_train', '--do_eval', '--do_predict', '--max_train_samples', '500', '--max_eval_samples', '50', '--max_predict_samples', '50', '--num_train_epochs', '3', '--dataset_name', 'wmt16', '--dataset_config', 'ro-en', '--source_lang', 'en', '--target_lang', 'ro', '--predict_with_generate', '--forced_bos_token', 'ro'] exits with return code = 1

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:9 (6 by maintainers)

github_iconTop GitHub Comments

1reaction
evros-chriscommented, Apr 14, 2022

Yes you are right, it does work! Thanks a lot for fixing this @stas00!

0reactions
stas00commented, Apr 13, 2022

update, nope, it works just fine. I have just suggested to you to use PYTHONPATH and haven’t used it myself 😉

Try again with:

git clone https://github.com/huggingface/transformers
cd transformers
git checkout ds-m2m-layerdrop
PYTHONPATH=src deepspeed examples/pytorch/translation/run_translation.py --deepspeed tests/deepspeed/ds_config_zero3.json --model_name_or_path facebook/m2m100_418M --per_device_train_batch_size 8 --per_device_eval_batch_size 8 --output_dir output_dir --overwrite_output_dir --fp16 --do_train --do_eval --do_predict --max_train_samples 500 --max_eval_samples 50 --max_predict_samples 50 --num_train_epochs 3 --dataset_name wmt16 --dataset_config "ro-en" --source_lang en --target_lang ro --predict_with_generate --forced_bos_token ro
Read more comments on GitHub >

github_iconTop Results From Across the Web

DeepSpeed - Release 0.8.0 Microsoft
Useful in scenarios where the user wants to use torch distributed calls before calling deepspeed.initialize(), such as when using model paral- lelism, pipeline ......
Read more >
`run_translation.py` example is erroring out with the ...
I am on 4 V100 GPUs, 32 CPUs, and 208 GBs of RAM. ... run_translation.py and ds_config_zero2.json were sourced from the transformers repository....
Read more >
DeepSpeed powers 8x larger MoE model training with high ...
Powered by DeepSpeed MoE, we can now train MoE models that are much larger compared with dense models. We trained a 24-layer encoder...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found