question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[DeepSpeed] strange learning rate schedule in linear_schedule_with_warmup

See original GitHub issue

Environment info

  • transformers version: 4.3.2
  • Platform: Linux
  • Python version: 3.7.3
  • PyTorch version (GPU?): 1.7 (yes)
  • Tensorflow version (GPU?): N/A
  • Using GPU in script?: yes
  • Using distributed or parallel set-up in script?: yes (DeepSpeed)

Who can help

@stas00

Information

Model I am using (Bert, XLNet …): GPT-2

The problem arises when using:

  • the official example scripts: (give details below)
  • my own modified scripts: (give details below)

The tasks I am working on is:

  • an official GLUE/SQUaD task: (give the name)
  • my own task or dataset: (give details below)

To reproduce

I am trying using deepspeed for run_clm.py to train GPT-2 (from scratch). I want to use the same scheduler (linear_schedule_with_warmup) and optimizer as ones used in run_clm.py. So, the scheduler and optimizer sections are removed in examples/tests/deepspeed/ds_config.json, and the original ones are used.

My ds_config.json is as follows:

{
    "fp16": {
        "enabled": true,
        "loss_scale": 0,
        "loss_scale_window": 1000,
        "hysteresis": 2,
        "min_loss_scale": 1
    },

   "zero_optimization": {
       "stage": 2,
       "allgather_partitions": true,
       "allgather_bucket_size": 2e8,
       "overlap_comm": true,
       "reduce_scatter": true,
       "reduce_bucket_size": 2e8,
       "contiguous_gradients": true,
       "cpu_offload": true
   },

   "zero_allow_untested_optimizer": true,

    "steps_per_print": 2000,
    "wall_clock_breakdown": false
}

I ran the following command (using 4GPUs in one node):

$ cd examples/language-modeling/ $ deepspeed run_clm.py
–output_dir=/somewhere
–model_type=gpt2
–do_train
–dataset_name wikitext
–dataset_config_name wikitext-2-raw-v1
–tokenizer_name gpt2
–block_size=512
–num_train_epochs=5
–warmup_steps=100
–learning_rate=2e-5
–per_device_train_batch_size=32
–per_device_eval_batch_size=32
–save_steps=10000
–save_total_limit=5
–dataloader_drop_last
–deepspeed ds_config.json
–logging_steps=10

The learning rate schedule was strange. The following is a screenshot of tensorboard.

image

The initial learning rate was 1e-5, which should be 0. The learning rate went up to 2e-5 (it was OK), and went down to 0 around the middle (before the end), which was strange.

I tested a WarmupDecayLR scheduler in deepspeed (without transformers), and it seemed OK. So, I think the utilization of this scheduler in transformers is strange.

Expected behavior

The learning rate schedule through deepspeed should be the same as the original one used in run_clm.py.

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:11 (11 by maintainers)

github_iconTop GitHub Comments

1reaction
stas00commented, Feb 24, 2021

Thank you for your feedback and supporting this problem fixing process, @tomohideshibata

1reaction
tomohideshibatacommented, Feb 24, 2021

Thanks.

I have tested the latest version (without setting "initial_scale_power": 1), and the learning rate behavior is as expected!

image

Thanks for your work. It is very useful to use deepspeed in transformers.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Learning Rate Schedulers — DeepSpeed 0.8.0 documentation
Learning Rate Schedulers¶. DeepSpeed offers implementations of LRRangeTest , OneCycle , WarmupLR , WarmupDecayLR learning rate schedulers.
Read more >
Optimization - Hugging Face
To use a manual (external) learning rate schedule you should set ... m and v parameters in strange ways as shown in Decoupled...
Read more >
DeepSpeed/lr_schedules.py at master - GitHub
DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, ... Implementation of learning rate schedules.
Read more >
Learning Rate Range Test - DeepSpeed
LRRT works by linearly increasing the learning rate by a predefined amount, at predefined intervals. Thus, LRRT is a form of learning rate...
Read more >
1-bit Adam: Communication Efficient Large-Scale Training ...
Scalable training of large models (like BERT and GPT-3) requires careful optimization rooted in model design, architecture, and system ca- pabilities.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found