Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Incorrect `num_warmup_steps` for `lr_scheduler` for multi-gpu training

See original GitHub issue

System Info

- `Accelerate` version: 0.10.0
- Platform: Linux-3.10.0_3-0-0-12-x86_64-with-centos-6.3-Final
- Python version: 3.7.12
- Numpy version: 1.21.6
- PyTorch version (GPU?): 1.7.1 (True)
- `Accelerate` default config:
        - compute_environment: LOCAL_MACHINE
        - distributed_type: MULTI_GPU
        - mixed_precision: no
        - use_cpu: False
        - num_processes: 8
        - machine_rank: 0
        - num_machines: 1
        - main_process_ip: None
        - main_process_port: None
        - main_training_function: main
        - deepspeed_config: {}
        - fsdp_config: {}

Information

The official example scripts
My own modified scripts

Tasks

One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
My own task or dataset (give details below)

Reproduction

https://github.com/huggingface/transformers/blob/f2fbe4475386bfcfb3b83d0a3223ba216a3c3a91/examples/pytorch/translation/run_translation_no_trainer.py#L533

# define lr scheduler
lr_scheduler = get_scheduler(
        name="linear",
        optimizer=optimizer,
        num_warmup_steps=args.warmup_steps,
        num_training_steps=args.max_train_steps,
    )

...

if step % args.gradient_accumulation_steps == 0:                    
      optimizer.step()
      lr_scheduler.step() # update lr scheduler every `gradient_accumulation_steps`
      optimizer.zero_grad()

Expected behavior

Is the accelerate consider the num of processes for num_warmup_steps? Suppose we set args.warmup_steps=80 and train on a single 8-gpu machine, the linear learning rate will peak at 10 (i.e., 80/8) rather than expected 80.

Issue Analytics

State:
Created a year ago
Comments:19 (1 by maintainers)

Top GitHub Comments

1reaction

pacman100commented, Aug 29, 2022

According to the design of accelerate

https://github.com/huggingface/accelerate/blob/d0f5f4a630bda69dcf89cc6d55f93c71f2af7a0d/src/accelerate/scheduler.py#L70

, is it correct to set the warmup_steps as warmup_steps*num_processes, or just do not prepare lr_scheduler?

Hello @cyk1337 , the link you have provided achieves args.max_train_steps // num_gpus because it is steping for num_processes per iteration, i.e., num_gpus times per iteration.

I didn’t understand what the query was in case of not preparing lr_scheduler. As per the original question, it is logical to have warmup steps to be reduced in a multi-device scenario.

0reactions

github-actions[bot]commented, Oct 2, 2022

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Top Results From Across the Web

How do I set the steps_per_epoch parameter of a lr scheduler ...

I've tried manually dividing the steps_per_epoch of the OneCycleLR scheduler by the number of GPUs when training on a multi-GPU machine.

Optimization - Hugging Face

Training without LR warmup or clip_threshold is not recommended. ... Creates an optimizer with a learning rate schedule using a warmup phase followed...

13.5. Training on Multiple GPUs - Dive into Deep Learning

So far we discussed how to train models efficiently on CPUs and GPUs. We even showed how deep learning frameworks allow one to...

The importance of hyperparameter tuning for scaling deep ...

Parallel processing with multiple GPUs is an important step in scaling training of deep models. In each training iteration, ...

How to Choose a Learning Rate Scheduler for Neural Networks

In this article, we will explore the learning rate, and explain why it's crucial to schedule our learning rate during model training.