Finetuning m2m100 with run_translation_no_trainer.py using ZERO stage 3 hangs when evaluation after first epoch
See original GitHub issueSystem Info
transformers
version: 4.22.0.dev0- Platform: Linux-5.15.0-41-generic-x86_64-with-glibc2.17
- Python version: 3.8.13
- Huggingface_hub version: 0.8.1
- PyTorch version (GPU?): 1.12.0+cu113 (True)
- Tensorflow version (GPU?): 2.10.0 (True)
- Flax version (CPU?/GPU?/TPU?): 0.4.1 (gpu)
- Jax version: 0.3.5
- JaxLib version: 0.3.5
- Using GPU in script?: <yes>
- Using distributed or parallel set-up in script?: <yes>
Who can help?
No response
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examples
folder (such as GLUE/SQuAD, β¦) - My own task or dataset (give details below)
Reproduction
- accelerate config Accelerate configs as follows:
compute_environment: LOCAL_MACHINE
deepspeed_config:
gradient_accumulation_steps: 1
gradient_clipping: 1.0
offload_optimizer_device: cpu
offload_param_device: cpu
zero3_init_flag: true
zero3_save_16bit_model: true
zero_stage: 3
distributed_type: DEEPSPEED
downcast_bf16: 'no'
fsdp_config: {}
machine_rank: 0
main_process_ip: null
main_process_port: null
main_training_function: main
mixed_precision: 'no'
num_machines: 1
num_processes: 4
use_cpu: false
- Run finetuning script with command:
accelerate launch run_translation_no_trainer.py --model_name_or_path facebook/m2m100_418M --source_lang ro --target_lang zh --train_file teddata/train.json --validation_file teddata/val.json --output_dir ./m2m100_418M --max_source_length 128 --max_target_length 128 --per_device_train_batch_size=8 --per_device_eval_batch_size=4 --forced_bos_token zh
Traing output infos: 11/09/2022 11:02:34 - INFO - main - ***** Running training ***** 11/09/2022 11:02:34 - INFO - main - Num examples = 1000 11/09/2022 11:02:34 - INFO - main - Num Epochs = 3 11/09/2022 11:02:34 - INFO - main - Instantaneous batch size per device = 8 11/09/2022 11:02:34 - INFO - main - Total train batch size (w. parallel, distributed & accumulation) = 32 11/09/2022 11:02:34 - INFO - main - Gradient Accumulation steps = 1 11/09/2022 11:02:34 - INFO - main - Total optimization steps = 94 33%|βββββββββββββββββββββββββββ 32/94[18:31<39:25, 9.20s/it]
Finetuning hangs here, all GPU-Util is almost 100%. While accelerate config set zero stage 2, finetuning is success .
Expected behavior
Success finish finetuning m2m100 with run_translation_no_trainer.py using ZERO stage 3.
Issue Analytics
- State:
- Created 10 months ago
- Comments:6 (3 by maintainers)
Top GitHub Comments
@pacman100 Yes! It works! Thanks very much!
Hello @cokuehuang, Thank you for giving the minimal script and data for reproducing the issue on our end. When using ZeRO stage-3 following needs to passed to
generate
function call:after adding it, everything should work just fine when using DS ZeRO-3 with/without cpu offloading