Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[DeepSpeed] strange learning rate schedule in linear_schedule_with_warmup

See original GitHub issue

Environment info

transformers version: 4.3.2
Platform: Linux
Python version: 3.7.3
PyTorch version (GPU?): 1.7 (yes)
Tensorflow version (GPU?): N/A
Using GPU in script?: yes
Using distributed or parallel set-up in script?: yes (DeepSpeed)

Who can help

@stas00

Information

Model I am using (Bert, XLNet …): GPT-2

The problem arises when using:

the official example scripts: (give details below)
my own modified scripts: (give details below)

The tasks I am working on is:

an official GLUE/SQUaD task: (give the name)
my own task or dataset: (give details below)

To reproduce

I am trying using deepspeed for run_clm.py to train GPT-2 (from scratch). I want to use the same scheduler (linear_schedule_with_warmup) and optimizer as ones used in run_clm.py. So, the scheduler and optimizer sections are removed in examples/tests/deepspeed/ds_config.json, and the original ones are used.

My ds_config.json is as follows:

{
    "fp16": {
        "enabled": true,
        "loss_scale": 0,
        "loss_scale_window": 1000,
        "hysteresis": 2,
        "min_loss_scale": 1
    },

   "zero_optimization": {
       "stage": 2,
       "allgather_partitions": true,
       "allgather_bucket_size": 2e8,
       "overlap_comm": true,
       "reduce_scatter": true,
       "reduce_bucket_size": 2e8,
       "contiguous_gradients": true,
       "cpu_offload": true
   },

   "zero_allow_untested_optimizer": true,

    "steps_per_print": 2000,
    "wall_clock_breakdown": false
}

I ran the following command (using 4GPUs in one node):

$ cd examples/language-modeling/ $ deepspeed run_clm.py
–output_dir=/somewhere
–model_type=gpt2
–do_train
–dataset_name wikitext
–dataset_config_name wikitext-2-raw-v1
–tokenizer_name gpt2
–block_size=512
–num_train_epochs=5
–warmup_steps=100
–learning_rate=2e-5
–per_device_train_batch_size=32
–per_device_eval_batch_size=32
–save_steps=10000
–save_total_limit=5
–dataloader_drop_last
–deepspeed ds_config.json
–logging_steps=10

The learning rate schedule was strange. The following is a screenshot of tensorboard.

The initial learning rate was 1e-5, which should be 0. The learning rate went up to 2e-5 (it was OK), and went down to 0 around the middle (before the end), which was strange.

I tested a WarmupDecayLR scheduler in deepspeed (without transformers), and it seemed OK. So, I think the utilization of this scheduler in transformers is strange.

Expected behavior

The learning rate schedule through deepspeed should be the same as the original one used in run_clm.py.

Issue Analytics

State:
Created 3 years ago
Comments:11 (11 by maintainers)

Top GitHub Comments

1reaction

stas00commented, Feb 24, 2021

Thank you for your feedback and supporting this problem fixing process, @tomohideshibata

1reaction

tomohideshibatacommented, Feb 24, 2021

Thanks.

I have tested the latest version (without setting "initial_scale_power": 1), and the learning rate behavior is as expected!

Thanks for your work. It is very useful to use deepspeed in transformers.

Top Results From Across the Web

Learning Rate Schedulers — DeepSpeed 0.8.0 documentation

Learning Rate Schedulers¶. DeepSpeed offers implementations of LRRangeTest , OneCycle , WarmupLR , WarmupDecayLR learning rate schedulers.

Optimization - Hugging Face

To use a manual (external) learning rate schedule you should set ... m and v parameters in strange ways as shown in Decoupled...

DeepSpeed/lr_schedules.py at master - GitHub

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, ... Implementation of learning rate schedules.

Learning Rate Range Test - DeepSpeed

LRRT works by linearly increasing the learning rate by a predefined amount, at predefined intervals. Thus, LRRT is a form of learning rate...

1-bit Adam: Communication Efficient Large-Scale Training ...

Scalable training of large models (like BERT and GPT-3) requires careful optimization rooted in model design, architecture, and system ca- pabilities.