Cannot train M2M100 using run_translation.py and DeepSpeed ZeRO stage 3
See original GitHub issueEnvironment info
transformers
version: 4.18.0- Platform: Linux
- Python version: 3.8.12
- PyTorch version (GPU?): 1.10
- Tensorflow version (GPU?): -
- Using GPU in script?: Yes
- Using distributed or parallel set-up in script?: Deepspeed ZeRO stage 3
Library Versions:
- deepspeed 0.6.1
- transformers 4.18.0
- pytorch 1.10
Who can help
Information
The problem arises when:
- I try to finetune the Hugging Face
facebook/m2m100_418M
model using therun_translation.py
script undertransformers/examples/pytorch/translation/run_translation.py
and deepspeed ZeRO stage 3. If I uset5-small
instead offacebook/m2m100_418M
then the model trains. Also, if I usefacebook/m2m100_418M
andds_config_zero2.json
instead ofds_config_zero3.json
, then the models trains again.
To reproduce
deepspeed run_translation.py \
--deepspeed ds_config_zero3.json \
--model_name_or_path facebook/m2m100_418M \
--per_device_train_batch_size 8 \
--per_device_eval_batch_size 8 \
--output_dir output_dir --overwrite_output_dir \
--fp16 \
--do_train --do_eval --do_predict \
--max_train_samples 500 --max_eval_samples 50 --max_predict_samples 50 \
--num_train_epochs 3 \
--dataset_name wmt16 --dataset_config "ro-en" \
--source_lang en --target_lang ro \
--predict_with_generate --forced_bos_token ro
where:
run_translation.py
is the same file as intransformers/examples/pytorch/translation/run_translation.py
ds_config_zero3.json
is the same file as intransformers/tests/deepspeed/ds_config_zero3.json
Error:
Traceback (most recent call last):
File "run_translation.py", line 636, in <module>
main()
File "run_translation.py", line 553, in main
train_result = trainer.train(resume_from_checkpoint=checkpoint)
File "/opt/conda/lib/python3.8/site-packages/transformers/trainer.py", line 1422, in train
tr_loss_step = self.training_step(model, inputs)
File "/opt/conda/lib/python3.8/site-packages/transformers/trainer.py", line 2011, in training_step
loss = self.compute_loss(model, inputs)
File "/opt/conda/lib/python3.8/site-packages/transformers/trainer.py", line 2043, in compute_loss
outputs = model(**inputs)
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/deepspeed/utils/nvtx.py", line 11, in wrapped_fn
return func(*args, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1556, in forward
loss = self.module(*inputs, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1120, in _call_impl
result = forward_call(*input, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/transformers/models/m2m_100/modeling_m2m_100.py", line 1306, in forward
outputs = self.model(
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1120, in _call_impl
result = forward_call(*input, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/transformers/models/m2m_100/modeling_m2m_100.py", line 1164, in forward
encoder_outputs = self.encoder(
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1120, in _call_impl
result = forward_call(*input, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/transformers/models/m2m_100/modeling_m2m_100.py", line 819, in forward
layer_outputs = encoder_layer(
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1120, in _call_impl
result = forward_call(*input, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/transformers/models/m2m_100/modeling_m2m_100.py", line 379, in forward
hidden_states = self.self_attn_layer_norm(hidden_states)
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1109, in _call_impl
result = hook(self, input)
File "/opt/conda/lib/python3.8/site-packages/deepspeed/utils/nvtx.py", line 11, in wrapped_fn
return func(*args, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/zero/stage3.py", line 1411, in _pre_forward_module_hook
self.pre_sub_module_forward_function(module)
File "/opt/conda/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 28, in decorate_context
return func(*args, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/zero/stage3.py", line 1528, in pre_sub_module_forward_function
self.param_coordinator.fetch_sub_module(sub_module)
File "/opt/conda/lib/python3.8/site-packages/deepspeed/utils/nvtx.py", line 11, in wrapped_fn
return func(*args, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 28, in decorate_context
return func(*args, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/zero/stage3.py", line 358, in fetch_sub_module
raise RuntimeError(
RuntimeError: tracing error at step 42: expected the next 2 parameters in the parameter fetch queue to be ({'id': 26, 'status': 'AVAILABLE', 'numel': 1024, 'ds_numel': 1024, 'shape': (1024,), 'ds_shape': (1024,), 'requires_grad': True, 'grad_shape': None, 'persist': True, 'active_sub_modules': {24}}, {'id': 27, 'status': 'AVAILABLE', 'numel': 1024, 'ds_numel': 1024, 'shape': (1024,), 'ds_shape': (1024,), 'requires_grad': True, 'grad_shape': None, 'persist': True, 'active_sub_modules': {24}}) but got ({'id': 115, 'status': 'NOT_AVAILABLE', 'numel': 0, 'ds_numel': 1024, 'shape': (0,), 'ds_shape': (1024,), 'requires_grad': True, 'grad_shape': None, 'persist': True, 'active_sub_modules': set()}, {'id': 116, 'status': 'NOT_AVAILABLE', 'numel': 0, 'ds_numel': 1048576, 'shape': (0,), 'ds_shape': (1024, 1024), 'requires_grad': True, 'grad_shape': None, 'persist': False, 'active_sub_modules': set()}).
1%|█ | 1/189 [00:01<04:33, 1.45s/it]
[2022-04-10 20:34:32,488] [INFO] [launch.py:178:sigkill_handler] Killing subprocess 41615
[2022-04-10 20:34:32,488] [ERROR] [launch.py:184:sigkill_handler] ['/opt/conda/bin/python3.8', '-u', 'run_translation.py', '--local_rank=0', '--deepspeed', 'config/ds_config_zero3.json', '--model_name_or_path', 'facebook/m2m100_418M', '--per_device_train_batch_size', '8', '--per_device_eval_batch_size', '8', '--output_dir', 'output_dir', '--overwrite_output_dir', '--fp16', '--do_train', '--do_eval', '--do_predict', '--max_train_samples', '500', '--max_eval_samples', '50', '--max_predict_samples', '50', '--num_train_epochs', '3', '--dataset_name', 'wmt16', '--dataset_config', 'ro-en', '--source_lang', 'en', '--target_lang', 'ro', '--predict_with_generate', '--forced_bos_token', 'ro'] exits with return code = 1
Expected behavior
The model trains.
Additional info
Changing deepspeed version from 0.6.1 to 0.5.10 and transformers version from 4.18.0 to 4.16.2, results in the following error:
Traceback (most recent call last):
File "run_translation.py", line 636, in <module>
main()
File "run_translation.py", line 553, in main
train_result = trainer.train(resume_from_checkpoint=checkpoint)
File "/opt/conda/lib/python3.8/site-packages/transformers/trainer.py", line 1365, in train
tr_loss_step = self.training_step(model, inputs)
File "/opt/conda/lib/python3.8/site-packages/transformers/trainer.py", line 1956, in training_step
loss = self.deepspeed.backward(loss)
File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1697, in backward
self.optimizer.backward(loss)
File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/zero/stage3.py", line 2944, in backward
self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)
File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 53, in backward
scaled_loss.backward(retain_graph=retain_graph)
File "/opt/conda/lib/python3.8/site-packages/torch/_tensor.py", line 307, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
File "/opt/conda/lib/python3.8/site-packages/torch/autograd/__init__.py", line 154, in backward
Variable._execution_engine.run_backward(
File "/opt/conda/lib/python3.8/site-packages/torch/autograd/function.py", line 199, in apply
return user_fn(self, *args)
File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/zero/stage3.py", line 562, in backward
ctx.pre_backward_function(ctx.module)
File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/zero/stage3.py", line 1456, in _run_before_backward_function
self.pre_sub_module_backward_function(sub_module)
File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/zero/stage3.py", line 1551, in pre_sub_module_backward_function
self.param_coordinator.prefetch_next_sub_modules(sub_module,
File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/zero/stage3.py", line 358, in prefetch_next_sub_modules
params_to_prefetch = self.prefetch_coordinator.get_params_to_prefetch(
File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/zero/stage3.py", line 220, in get_params_to_prefetch
if sub_module.id != self.sub_module_trace[self.step_id]:
IndexError: list index out of range
1%|█ | 1/189 [00:01<04:02, 1.29s/it]
[2022-04-10 20:44:02,482] [INFO] [launch.py:160:sigkill_handler] Killing subprocess 45884
[2022-04-10 20:44:02,482] [ERROR] [launch.py:166:sigkill_handler] ['/opt/conda/bin/python3.8', '-u', 'run_translation.py', '--local_rank=0', '--deepspeed', 'config/ds_config_zero3.json', '--model_name_or_path', 'facebook/m2m100_418M', '--per_device_train_batch_size', '8', '--per_device_eval_batch_size', '8', '--output_dir', 'output_dir', '--overwrite_output_dir', '--fp16', '--do_train', '--do_eval', '--do_predict', '--max_train_samples', '500', '--max_eval_samples', '50', '--max_predict_samples', '50', '--num_train_epochs', '3', '--dataset_name', 'wmt16', '--dataset_config', 'ro-en', '--source_lang', 'en', '--target_lang', 'ro', '--predict_with_generate', '--forced_bos_token', 'ro'] exits with return code = 1
Issue Analytics
- State:
- Created a year ago
- Comments:9 (6 by maintainers)
Top Results From Across the Web
DeepSpeed - Release 0.8.0 Microsoft
Useful in scenarios where the user wants to use torch distributed calls before calling deepspeed.initialize(), such as when using model paral- lelism, pipeline ......
Read more >`run_translation.py` example is erroring out with the ...
I am on 4 V100 GPUs, 32 CPUs, and 208 GBs of RAM. ... run_translation.py and ds_config_zero2.json were sourced from the transformers repository....
Read more >DeepSpeed powers 8x larger MoE model training with high ...
Powered by DeepSpeed MoE, we can now train MoE models that are much larger compared with dense models. We trained a 24-layer encoder...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Yes you are right, it does work! Thanks a lot for fixing this @stas00!
update, nope, it works just fine. I have just suggested to you to use
PYTHONPATH
and haven’t used it myself 😉Try again with: