question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. ItΒ collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Finetuning m2m100 with run_translation_no_trainer.py using ZERO stage 3 hangs when evaluation after first epoch

See original GitHub issue

System Info

  • transformers version: 4.22.0.dev0
  • Platform: Linux-5.15.0-41-generic-x86_64-with-glibc2.17
  • Python version: 3.8.13
  • Huggingface_hub version: 0.8.1
  • PyTorch version (GPU?): 1.12.0+cu113 (True)
  • Tensorflow version (GPU?): 2.10.0 (True)
  • Flax version (CPU?/GPU?/TPU?): 0.4.1 (gpu)
  • Jax version: 0.3.5
  • JaxLib version: 0.3.5
  • Using GPU in script?: <yes>
  • Using distributed or parallel set-up in script?: <yes>

Who can help?

No response

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, …)
  • My own task or dataset (give details below)

Reproduction

  1. accelerate config Accelerate configs as follows:
compute_environment: LOCAL_MACHINE
deepspeed_config:
  gradient_accumulation_steps: 1
  gradient_clipping: 1.0
  offload_optimizer_device: cpu
  offload_param_device: cpu
  zero3_init_flag: true
  zero3_save_16bit_model: true
  zero_stage: 3
distributed_type: DEEPSPEED
downcast_bf16: 'no'
fsdp_config: {}
machine_rank: 0
main_process_ip: null
main_process_port: null
main_training_function: main
mixed_precision: 'no'
num_machines: 1
num_processes: 4
use_cpu: false
  1. Run finetuning script with command: accelerate launch run_translation_no_trainer.py --model_name_or_path facebook/m2m100_418M --source_lang ro --target_lang zh --train_file teddata/train.json --validation_file teddata/val.json --output_dir ./m2m100_418M --max_source_length 128 --max_target_length 128 --per_device_train_batch_size=8 --per_device_eval_batch_size=4 --forced_bos_token zh

Traing output infos: 11/09/2022 11:02:34 - INFO - main - ***** Running training ***** 11/09/2022 11:02:34 - INFO - main - Num examples = 1000 11/09/2022 11:02:34 - INFO - main - Num Epochs = 3 11/09/2022 11:02:34 - INFO - main - Instantaneous batch size per device = 8 11/09/2022 11:02:34 - INFO - main - Total train batch size (w. parallel, distributed & accumulation) = 32 11/09/2022 11:02:34 - INFO - main - Gradient Accumulation steps = 1 11/09/2022 11:02:34 - INFO - main - Total optimization steps = 94 33%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ 32/94[18:31<39:25, 9.20s/it]

Finetuning hangs here, all GPU-Util is almost 100%. While accelerate config set zero stage 2, finetuning is success .

Expected behavior

Success finish finetuning m2m100 with run_translation_no_trainer.py using ZERO stage 3.

Issue Analytics

  • State:closed
  • Created 10 months ago
  • Comments:6 (3 by maintainers)

github_iconTop GitHub Comments

1reaction
cokuehuangcommented, Nov 11, 2022

@pacman100 Yes! It works! Thanks very much!

0reactions
pacman100commented, Nov 10, 2022

Hello @cokuehuang, Thank you for giving the minimal script and data for reproducing the issue on our end. When using ZeRO stage-3 following needs to passed to generate function call:

if accelerator.state.deepspeed_plugin.zero_stage == 3:
            gen_kwargs["synced_gpus"] = True #required for ZeRO Stage 3

after adding it, everything should work just fine when using DS ZeRO-3 with/without cpu offloading

11/10/2022 14:09:03 - INFO - __main__ - ***** Running training *****
11/10/2022 14:09:03 - INFO - __main__ -   Num examples = 1000
11/10/2022 14:09:03 - INFO - __main__ -   Num Epochs = 3
11/10/2022 14:09:03 - INFO - __main__ -   Instantaneous batch size per device = 16
11/10/2022 14:09:03 - INFO - __main__ -   Total train batch size (w. parallel, distributed & accumulation) = 32
11/10/2022 14:09:03 - INFO - __main__ -   Gradient Accumulation steps = 1
11/10/2022 14:09:03 - INFO - __main__ -   Total optimization steps = 96
 33%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ                                          | 32/96 [01:14<02:28,  2.32s/it]{'max_length': 128, 'num_beams': None, 'synced_gpus': True}
 33%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ                                          | 32/96 [01:14<02:28,  2.32s/it]{'max_length': 128, 'num_beams': None, 'synced_gpus': True}
11/10/2022 14:13:04 - INFO - __main__ - {'bleu': 6.697252711851462}
 67%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ                     | 64/96 [05:13<01:14,  2.32s/it]{'max_length': 128, 'num_beams': None, 'synced_gpus': True}
 67%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ                     | 64/96 [05:13<01:14,  2.32s/it]{'max_length': 128, 'num_beams': None, 'synced_gpus': True}
11/10/2022 14:16:52 - INFO - __main__ - {'bleu': 6.944214970589274}
100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 96/96 [09:02<00:00,  2.33s/it]{'max_length': 128, 'num_beams': None, 'synced_gpus': True}
100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 96/96 [09:02<00:00,  2.33s/it]{'max_length': 128, 'num_beams': None, 'synced_gpus': True}
11/10/2022 14:20:52 - INFO - __main__ - {'bleu': 6.8998500689065}
Configuration saved in ./m2m100_418M/config.json
100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 96/96 [11:48<00:00,  7.38s/it]
Model weights saved in ./m2m100_418M/pytorch_model.bin
tokenizer config file saved in ./m2m100_418M/tokenizer_config.json
Special tokens file saved in ./m2m100_418M/special_tokens_map.json
100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 96/96 [11:48<00:00,  7.38s/it]
Read more comments on GitHub >

github_iconTop Results From Across the Web

m2m-100 finetuning messes up lang pairs #16430 - GitHub
Issue Currently if I attempt to fine-tune M2M100 (many-to-many 100) on one language pair, what happens is the training data is convolutedΒ ...
Read more >
M2M100 - Hugging Face
In this work, we create a true Many-to-Many multilingual translation model that can translate directly between any pair of 100 languages. We build...
Read more >
A Catalan-German machine translation system based on the ...
In this project, based on Softcatal`a's bilingual corpus, I adapted Facebook's multilingual M2M100 model for machine translation between. GermanΒ ...
Read more >
arXiv:2104.08757v2 [cs.CL] 5 Nov 2021
Table 1: BLEU results of different initialization and fine-tuning strategies on zero-shot any-to-English language pairs. Starting from vanillaΒ ...
Read more >
TenTrans Multilingual Low-Resource Translation System for ...
This paper describes TenTrans' submission to. WMT21 Multilingual Low-Resource Transla- tion shared task for the Romance language pairs. This task focuses on ......
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found