[DeepSpeed] strange learning rate schedule in linear_schedule_with_warmup
See original GitHub issueEnvironment info
transformers
version: 4.3.2- Platform: Linux
- Python version: 3.7.3
- PyTorch version (GPU?): 1.7 (yes)
- Tensorflow version (GPU?): N/A
- Using GPU in script?: yes
- Using distributed or parallel set-up in script?: yes (DeepSpeed)
Who can help
Information
Model I am using (Bert, XLNet …): GPT-2
The problem arises when using:
- the official example scripts: (give details below)
- my own modified scripts: (give details below)
The tasks I am working on is:
- an official GLUE/SQUaD task: (give the name)
- my own task or dataset: (give details below)
To reproduce
I am trying using deepspeed for run_clm.py to train GPT-2 (from scratch).
I want to use the same scheduler (linear_schedule_with_warmup
) and optimizer
as ones used in run_clm.py.
So, the scheduler
and optimizer
sections are removed in examples/tests/deepspeed/ds_config.json
,
and the original ones are used.
My ds_config.json
is as follows:
{
"fp16": {
"enabled": true,
"loss_scale": 0,
"loss_scale_window": 1000,
"hysteresis": 2,
"min_loss_scale": 1
},
"zero_optimization": {
"stage": 2,
"allgather_partitions": true,
"allgather_bucket_size": 2e8,
"overlap_comm": true,
"reduce_scatter": true,
"reduce_bucket_size": 2e8,
"contiguous_gradients": true,
"cpu_offload": true
},
"zero_allow_untested_optimizer": true,
"steps_per_print": 2000,
"wall_clock_breakdown": false
}
I ran the following command (using 4GPUs in one node):
$ cd examples/language-modeling/
$ deepspeed run_clm.py
–output_dir=/somewhere
–model_type=gpt2
–do_train
–dataset_name wikitext
–dataset_config_name wikitext-2-raw-v1
–tokenizer_name gpt2
–block_size=512
–num_train_epochs=5
–warmup_steps=100
–learning_rate=2e-5
–per_device_train_batch_size=32
–per_device_eval_batch_size=32
–save_steps=10000
–save_total_limit=5
–dataloader_drop_last
–deepspeed ds_config.json
–logging_steps=10
The learning rate schedule was strange. The following is a screenshot of tensorboard.
The initial learning rate was 1e-5, which should be 0. The learning rate went up to 2e-5 (it was OK), and went down to 0 around the middle (before the end), which was strange.
I tested a WarmupDecayLR
scheduler in deepspeed
(without transformers
), and it seemed OK.
So, I think the utilization of this scheduler in transformers
is strange.
Expected behavior
The learning rate schedule through deepspeed
should be the same as the original one used in run_clm.py
.
Issue Analytics
- State:
- Created 3 years ago
- Comments:11 (11 by maintainers)
Top GitHub Comments
Thank you for your feedback and supporting this problem fixing process, @tomohideshibata
Thanks.
I have tested the latest version (without setting
"initial_scale_power": 1
), and the learning rate behavior is as expected!Thanks for your work. It is very useful to use deepspeed in transformers.