Incorrect `num_warmup_steps` for `lr_scheduler` for multi-gpu training
See original GitHub issueSystem Info
- `Accelerate` version: 0.10.0
- Platform: Linux-3.10.0_3-0-0-12-x86_64-with-centos-6.3-Final
- Python version: 3.7.12
- Numpy version: 1.21.6
- PyTorch version (GPU?): 1.7.1 (True)
- `Accelerate` default config:
- compute_environment: LOCAL_MACHINE
- distributed_type: MULTI_GPU
- mixed_precision: no
- use_cpu: False
- num_processes: 8
- machine_rank: 0
- num_machines: 1
- main_process_ip: None
- main_process_port: None
- main_training_function: main
- deepspeed_config: {}
- fsdp_config: {}
Information
- The official example scripts
- My own modified scripts
Tasks
- One of the scripts in the examples/ folder of Accelerate or an officially supported
no_trainer
script in theexamples
folder of thetransformers
repo (such asrun_no_trainer_glue.py
) - My own task or dataset (give details below)
Reproduction
# define lr scheduler
lr_scheduler = get_scheduler(
name="linear",
optimizer=optimizer,
num_warmup_steps=args.warmup_steps,
num_training_steps=args.max_train_steps,
)
...
if step % args.gradient_accumulation_steps == 0:
optimizer.step()
lr_scheduler.step() # update lr scheduler every `gradient_accumulation_steps`
optimizer.zero_grad()
Expected behavior
Is the accelerate
consider the num of processes for num_warmup_steps
?
Suppose we set args.warmup_steps=80
and train on a single 8-gpu machine, the linear learning rate will peak at 10 (i.e., 80/8
) rather than expected 80.
Issue Analytics
- State:
- Created a year ago
- Comments:19 (1 by maintainers)
Top Results From Across the Web
How do I set the steps_per_epoch parameter of a lr scheduler ...
I've tried manually dividing the steps_per_epoch of the OneCycleLR scheduler by the number of GPUs when training on a multi-GPU machine.
Read more >Optimization - Hugging Face
Training without LR warmup or clip_threshold is not recommended. ... Creates an optimizer with a learning rate schedule using a warmup phase followed...
Read more >13.5. Training on Multiple GPUs - Dive into Deep Learning
So far we discussed how to train models efficiently on CPUs and GPUs. We even showed how deep learning frameworks allow one to...
Read more >The importance of hyperparameter tuning for scaling deep ...
Parallel processing with multiple GPUs is an important step in scaling training of deep models. In each training iteration, ...
Read more >How to Choose a Learning Rate Scheduler for Neural Networks
In this article, we will explore the learning rate, and explain why it's crucial to schedule our learning rate during model training.
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Hello @cyk1337 , the link you have provided achieves
args.max_train_steps // num_gpus
because it is steping fornum_processes
per iteration, i.e., num_gpus times per iteration.I didn’t understand what the query was in case of not preparing
lr_scheduler
. As per the original question, it is logical to have warmup steps to be reduced in a multi-device scenario.This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.