Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Finetuning m2m100 with run_translation_no_trainer.py using ZERO stage 3 hangs when evaluation after first epoch

See original GitHub issue

System Info

transformers version: 4.22.0.dev0
Platform: Linux-5.15.0-41-generic-x86_64-with-glibc2.17
Python version: 3.8.13
Huggingface_hub version: 0.8.1
PyTorch version (GPU?): 1.12.0+cu113 (True)
Tensorflow version (GPU?): 2.10.0 (True)
Flax version (CPU?/GPU?/TPU?): 0.4.1 (gpu)
Jax version: 0.3.5
JaxLib version: 0.3.5
Using GPU in script?: <yes>
Using distributed or parallel set-up in script?: <yes>

Who can help?

No response

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, …)
My own task or dataset (give details below)

Reproduction

accelerate config Accelerate configs as follows:

compute_environment: LOCAL_MACHINE
deepspeed_config:
  gradient_accumulation_steps: 1
  gradient_clipping: 1.0
  offload_optimizer_device: cpu
  offload_param_device: cpu
  zero3_init_flag: true
  zero3_save_16bit_model: true
  zero_stage: 3
distributed_type: DEEPSPEED
downcast_bf16: 'no'
fsdp_config: {}
machine_rank: 0
main_process_ip: null
main_process_port: null
main_training_function: main
mixed_precision: 'no'
num_machines: 1
num_processes: 4
use_cpu: false

Run finetuning script with command: accelerate launch run_translation_no_trainer.py --model_name_or_path facebook/m2m100_418M --source_lang ro --target_lang zh --train_file teddata/train.json --validation_file teddata/val.json --output_dir ./m2m100_418M --max_source_length 128 --max_target_length 128 --per_device_train_batch_size=8 --per_device_eval_batch_size=4 --forced_bos_token zh

Traing output infos: 11/09/2022 11:02:34 - INFO - main - ***** Running training ***** 11/09/2022 11:02:34 - INFO - main - Num examples = 1000 11/09/2022 11:02:34 - INFO - main - Num Epochs = 3 11/09/2022 11:02:34 - INFO - main - Instantaneous batch size per device = 8 11/09/2022 11:02:34 - INFO - main - Total train batch size (w. parallel, distributed & accumulation) = 32 11/09/2022 11:02:34 - INFO - main - Gradient Accumulation steps = 1 11/09/2022 11:02:34 - INFO - main - Total optimization steps = 94 33%|███████████████████████████ 32/94[18:31<39:25, 9.20s/it]

Finetuning hangs here, all GPU-Util is almost 100%. While accelerate config set zero stage 2, finetuning is success .

Expected behavior

Success finish finetuning m2m100 with run_translation_no_trainer.py using ZERO stage 3.

Issue Analytics

State:
Created 10 months ago
Comments:6 (3 by maintainers)

Top GitHub Comments

1reaction

cokuehuangcommented, Nov 11, 2022

@pacman100 Yes! It works! Thanks very much!

0reactions

pacman100commented, Nov 10, 2022

Hello @cokuehuang, Thank you for giving the minimal script and data for reproducing the issue on our end. When using ZeRO stage-3 following needs to passed to generate function call:

if accelerator.state.deepspeed_plugin.zero_stage == 3:
            gen_kwargs["synced_gpus"] = True #required for ZeRO Stage 3

after adding it, everything should work just fine when using DS ZeRO-3 with/without cpu offloading

11/10/2022 14:09:03 - INFO - __main__ - ***** Running training *****
11/10/2022 14:09:03 - INFO - __main__ -   Num examples = 1000
11/10/2022 14:09:03 - INFO - __main__ -   Num Epochs = 3
11/10/2022 14:09:03 - INFO - __main__ -   Instantaneous batch size per device = 16
11/10/2022 14:09:03 - INFO - __main__ -   Total train batch size (w. parallel, distributed & accumulation) = 32
11/10/2022 14:09:03 - INFO - __main__ -   Gradient Accumulation steps = 1
11/10/2022 14:09:03 - INFO - __main__ -   Total optimization steps = 96
 33%|█████████████████████                                          | 32/96 [01:14<02:28,  2.32s/it]{'max_length': 128, 'num_beams': None, 'synced_gpus': True}
 33%|█████████████████████                                          | 32/96 [01:14<02:28,  2.32s/it]{'max_length': 128, 'num_beams': None, 'synced_gpus': True}
11/10/2022 14:13:04 - INFO - __main__ - {'bleu': 6.697252711851462}
 67%|██████████████████████████████████████████                     | 64/96 [05:13<01:14,  2.32s/it]{'max_length': 128, 'num_beams': None, 'synced_gpus': True}
 67%|██████████████████████████████████████████                     | 64/96 [05:13<01:14,  2.32s/it]{'max_length': 128, 'num_beams': None, 'synced_gpus': True}
11/10/2022 14:16:52 - INFO - __main__ - {'bleu': 6.944214970589274}
100%|███████████████████████████████████████████████████████████████| 96/96 [09:02<00:00,  2.33s/it]{'max_length': 128, 'num_beams': None, 'synced_gpus': True}
100%|███████████████████████████████████████████████████████████████| 96/96 [09:02<00:00,  2.33s/it]{'max_length': 128, 'num_beams': None, 'synced_gpus': True}
11/10/2022 14:20:52 - INFO - __main__ - {'bleu': 6.8998500689065}
Configuration saved in ./m2m100_418M/config.json
100%|███████████████████████████████████████████████████████████████| 96/96 [11:48<00:00,  7.38s/it]
Model weights saved in ./m2m100_418M/pytorch_model.bin
tokenizer config file saved in ./m2m100_418M/tokenizer_config.json
Special tokens file saved in ./m2m100_418M/special_tokens_map.json
100%|███████████████████████████████████████████████████████████████| 96/96 [11:48<00:00,  7.38s/it]

Top Results From Across the Web

m2m-100 finetuning messes up lang pairs #16430 - GitHub

Issue Currently if I attempt to fine-tune M2M100 (many-to-many 100) on one language pair, what happens is the training data is convoluted ...

M2M100 - Hugging Face

In this work, we create a true Many-to-Many multilingual translation model that can translate directly between any pair of 100 languages. We build...

A Catalan-German machine translation system based on the ...

In this project, based on Softcatal`a's bilingual corpus, I adapted Facebook's multilingual M2M100 model for machine translation between. German ...

arXiv:2104.08757v2 [cs.CL] 5 Nov 2021

Table 1: BLEU results of different initialization and fine-tuning strategies on zero-shot any-to-English language pairs. Starting from vanilla ...

TenTrans Multilingual Low-Resource Translation System for ...

This paper describes TenTrans' submission to. WMT21 Multilingual Low-Resource Transla- tion shared task for the Romance language pairs. This task focuses on ......