[BUG] gradient overflow with fp16 enabled
See original GitHub issueDescribe the bug
I was trying to use deepspeed fp16 mode to train a 4.7B gpt2 model. At first it works fine, but after about 220 steps, the logs start to indicate an overflow and keep reducing the loss scale until it reaches 1 and keeps repeating, and the loss value changed to NaN
.
I tried the method metioned in that answer: https://github.com/huggingface/transformers/issues/15570#issuecomment-1035306492, but after setting a higher initial scale power(I have tried 18, 24, 32), overflow even present in the first step:
steps: 1 loss: 11.4926 iter time (s): 36.217 samples/sec: 70.686
[2022-02-16 01:22:29,596] [INFO] [logging.py:69:log_dist] [Rank 0] rank=0 time (ms) | pipe_send_output: 4606.68 | pipe_recv_grad: 753.47
[2022-02-16 01:23:00,852] [INFO] [stage_1_and_2.py:1644:step] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 4294967296, reducing to 2147483648.0
[2022-02-16 01:23:00,853] [INFO] [logging.py:69:log_dist] [Rank 0] step=2, skipped=2, lr=[0.0001, 0.0001], mom=[(0.85, 0.99), (0.85, 0.99)]
[2022-02-16 01:23:00,853] [INFO] [logging.py:69:log_dist] [Rank 0] rank=0 time (ms) | batch_input: 325.46 | forward_microstep: 8068.21 | backward_microstep: 20297.51 | backward_inner_microstep: 20295.81 | backward_allreduce_microstep: 0.00 | step_microstep: 2.04
[2022-02-16 01:23:00,853] [INFO] [logging.py:69:log_dist] [Rank 0] rank=0 time (ms) | forward: 8069.56 | backward: 20296.67 | backward_inner: 20294.76 | backward_allreduce: 0.00 | step: 2.03
steps: 2 loss: 11.4975 iter time (s): 30.870 samples/sec: 82.929
[2022-02-16 01:23:00,853] [INFO] [logging.py:69:log_dist] [Rank 0] rank=0 time (ms) | pipe_send_output: 1377.88 | pipe_recv_grad: 663.08
[2022-02-16 01:23:31,730] [INFO] [stage_1_and_2.py:1644:step] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 2147483648.0, reducing to 1073741824.0
[2022-02-16 01:23:31,731] [INFO] [logging.py:69:log_dist] [Rank 0] step=3, skipped=3, lr=[0.0001, 0.0001], mom=[(0.85, 0.99), (0.85, 0.99)]
[2022-02-16 01:23:31,731] [INFO] [logging.py:69:log_dist] [Rank 0] rank=0 time (ms) | batch_input: 294.01 | forward_microstep: 8067.96 | backward_microstep: 20333.02 | backward_inner_microstep: 20331.33 | backward_allreduce_microstep: 0.00 | step_microstep: 2.23
[2022-02-16 01:23:31,731] [INFO] [logging.py:69:log_dist] [Rank 0] rank=0 time (ms) | forward: 8068.90 | backward: 20332.18 | backward_inner: 20330.27 | backward_allreduce: 0.00 | step: 2.21
I also tried this: https://github.com/microsoft/DeepSpeed/issues/697#issuecomment-767874226, but what’s worse is that this method leads to another kind of error:
AssertionError: data parallel group is not initialized
So now I don’t have any idea how to solve this problem.
To Reproduce Here’s my training script:
#! /bin/bash
GPUS_PER_NODE=6
DATETIME=`date +'date_%y-%m-%d_time_%H-%M-%S'`
LOAD_CHECKPOINT_PATH=/some_path
SAVE_CHECKPOINT_PATH=/some_path
TENSORBOARD_PATH=/workspace/tensorboard/$DATETIME
VOCAB_FILE=vocab.txt
DATA_PATH=$(cat dataset.txt)
CONFIG_JSON=deepspeed_config.json
TENSOR_PARALLEL=2
PIPELINE_PARALLEL=3
HIDDEN=3072
ATTENTION_HEADS=24
LAYERS=40
SEQ=2048
GLOBAL_BATCH=2560
MICRO_BATCH=64
TOKENS=1000000000
ZERO_STAGE=1
cat <<EOT > $CONFIG_JSON
{
"train_batch_size" : $GLOBAL_BATCH,
"train_micro_batch_size_per_gpu": $MICRO_BATCH,
"steps_per_print": 1,
"wall_clock_breakdown": true,
"gradient_clipping": 1.0,
"prescale_gradients": false,
"optimizer": {
"type": "OneBitLamb",
"params": {
"lr": 11e-3,
"max_coeff": 0.3,
"min_coeff": 0.01,
"freeze_step": 1000,
"cuda_aware": false,
"comm_backend_name": "nccl",
"coeff_beta": 0.9,
"factor_max": 4.0,
"factor_min": 0.5,
"factor_threshold": 0.1
}
},
"scheduler": {
"type": "OneCycle",
"params": {
"cycle_first_step_size": 1000,
"cycle_first_stair_count": 500,
"cycle_second_step_size": 1000,
"cycle_second_stair_count": 500,
"decay_step_size": 1000,
"cycle_min_lr": 0.0001,
"cycle_max_lr": 0.0010,
"decay_lr_rate": 0.001,
"cycle_min_mom": 0.85,
"cycle_max_mom": 0.99,
"decay_mom_rate": 0.0
}
},
"aio": {
"thread_count": 8,
"single_submit": true
},
"zero_optimization": {
"stage": $ZERO_STAGE,
"contiguous_gradients": true,
"overlap_comm": true,
"reduce_scatter": true,
"reduce_bucket_size": 5e8,
"allgather_bucket_size": 5e8
},
"activation_checkpointing": {
"partition_activations": true,
"cpu_checkpointing": true,
"profile": true
},
"fp16": {
"enabled": true,
"loss_scale": 0,
"loss_scale_window": 500,
"hysteresis": 2,
"min_loss_scale": 1,
"initial_scale_power": 12
},
"curriculum_learning": {
"enabled": true,
"curriculum_type": "seqlen",
"schedule_type": "fixed_linear",
"min_difficulty": 64,
"max_difficulty": 1024,
"schedule_config": {
"total_curriculum_step": 15000,
"difficulty_step": 8
}
}
}
EOT
OPTIONS="--tokenizer-type EncDecTokenizer \
--vocab-file $VOCAB_FILE \
--tensor-model-parallel-size $TENSOR_PARALLEL \
--pipeline-model-parallel-size $PIPELINE_PARALLEL \
--num-layers $LAYERS \
--hidden-size $HIDDEN \
--num-attention-heads $ATTENTION_HEADS \
--seq-length $SEQ \
--max-position-embeddings $SEQ \
--micro-batch-size $MICRO_BATCH \
--global-batch-size $GLOBAL_BATCH \
--train-samples 1000000000 \
--train-tokens $TOKENS \
--data-path $DATA_PATH \
--save $SAVE_CHECKPOINT_PATH \
--load $LOAD_CHECKPOINT_PATH \
--save-interval 5000 \
--tensorboard-dir $TENSORBOARD_PATH \
--tensorboard-log-interval 1 \
--checkpoint-activations \
--checkpoint-num-layers 1 \
--log-num-zeros-in-grad \
--log-params-norm \
--log-interval 100 \
--data-impl mmap \
--split 100,0,0 \
--distributed-backend nccl \
--lr 0.0001 \
--min-lr 1.0e-5 \
--lr-decay-style cosine \
--lr-decay-tokens 900000000 \
--weight-decay 1e-2 \
--clip-grad 1.0 \
--lr-warmup-samples 52083 \
--fp16 \
--fp16-lm-cross-entropy"
OPTIONS="${OPTIONS} \
--deepspeed \
--deepspeed_config=${CONFIG_JSON} \
--zero-stage=${ZERO_STAGE} \
--deepspeed-activation-checkpointing \
"
deepspeed --num_gpus ${GPUS_PER_NODE} \
./pretrain_gpt.py $@ ${OPTIONS}
The gradient overflow occured in step 227 and after:
steps: 227 loss: 7.4369 iter time (s): 34.995 samples/sec: 73.153
[2022-02-15 18:51:10,812] [INFO] [logging.py:69:log_dist] [Rank 0] rank=0 time (ms) | pipe_send_output: 1271.62 | pipe_recv_grad: 558.71
[2022-02-15 18:51:45,705] [INFO] [stage_1_and_2.py:1644:step] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 4096, reducing to 4096
[2022-02-15 18:51:45,706] [INFO] [logging.py:69:log_dist] [Rank 0] step=228, skipped=1, lr=[0.00030429999999999986, 0.00030429999999999986], mom=[(0.95822, 0.99), (0.95822, 0.99)]
[2022-02-15 18:51:45,706] [INFO] [logging.py:69:log_dist] [Rank 0] rank=0 time (ms) | batch_input: 169.80 | forward_microstep: 9296.47 | backward_microstep: 23461.88 | backward_inner_microstep: 23460.12 | backward_allreduce_microstep: 0.00 | step_microstep: 2.11
[2022-02-15 18:51:45,706] [INFO] [logging.py:69:log_dist] [Rank 0] rank=0 time (ms) | forward: 9298.00 | backward: 23461.01 | backward_inner: 23458.97 | backward_allreduce: 0.00 | step: 2.08
steps: 228 loss: nan iter time (s): 34.890 samples/sec: 73.373
[2022-02-15 18:51:45,706] [INFO] [logging.py:69:log_dist] [Rank 0] rank=0 time (ms) | pipe_send_output: 1259.52 | pipe_recv_grad: 565.64
[2022-02-15 18:52:20,687] [INFO] [stage_1_and_2.py:1644:step] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 4096, reducing to 2048.0
[2022-02-15 18:52:20,687] [INFO] [logging.py:69:log_dist] [Rank 0] step=229, skipped=2, lr=[0.00030429999999999986, 0.00030429999999999986], mom=[(0.95822, 0.99), (0.95822, 0.99)]
[2022-02-15 18:52:20,687] [INFO] [logging.py:69:log_dist] [Rank 0] rank=0 time (ms) | batch_input: 151.81 | forward_microstep: 9293.15 | backward_microstep: 23479.71 | backward_inner_microstep: 23477.89 | backward_allreduce_microstep: 0.00 | step_microstep: 1.96
[2022-02-15 18:52:20,688] [INFO] [logging.py:69:log_dist] [Rank 0] rank=0 time (ms) | forward: 9294.62 | backward: 23478.80 | backward_inner: 23476.75 | backward_allreduce: 0.00 | step: 1.95
...
steps: 242 loss: nan iter time (s): 34.500 samples/sec: 74.202
[2022-02-15 18:59:50,599] [INFO] [logging.py:69:log_dist] [Rank 0] rank=0 time (ms) | pipe_send_output: 1244.14 | pipe_recv_grad: 592.68
[2022-02-15 19:00:25,037] [INFO] [stage_1_and_2.py:1644:step] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 1, reducing to 1
[2022-02-15 19:00:25,038] [INFO] [logging.py:69:log_dist] [Rank 0] step=243, skipped=15, lr=[0.00030520000000000015, 0.00030520000000000015], mom=[(0.9580799999999999, 0.99), (0.9580799999999999, 0.99)]
[2022-02-15 19:00:25,038] [INFO] [logging.py:69:log_dist] [Rank 0] rank=0 time (ms) | batch_input: 158.33 | forward_microstep: 9217.78 | backward_microstep: 23053.91 | backward_inner_microstep: 23052.11 | backward_allreduce_microstep: 0.00 | step_microstep: 2.70
[2022-02-15 19:00:25,039] [INFO] [logging.py:69:log_dist] [Rank 0] rank=0 time (ms) | forward: 9219.42 | backward: 23052.99 | backward_inner: 23050.96 | backward_allreduce: 0.00 | step: 2.68
System info (please complete the following information):
- OS: Ubuntu 20.04
- GPU count and types x8 NVIDIA 3090
- Python version: 3.8
- CUDA: 11.0
Issue Analytics
- State:
- Created 2 years ago
- Comments:9 (5 by maintainers)
Top GitHub Comments
I have only been using bf16 with https://github.com/bigscience-workshop/Megatron-DeepSpeed and it’s fantastic for huge models.
with HF Trainer I haven’t had a chance to do any serious training with bf16 so I don’t know if its solid or not besides the grad accum caveat.
The format is superior to fp16 because of its much bigger dynamic range and thus no overflows, now the various implementations could make it or break it. e.g. fp16 grad accumulation in fp16 is fine since it has a relatively high precision, but for bf16 is too lossy and will probably impact the training for the worse. So really need to implement grad accumulation in fp32 there. Probably should at least create an Issue so that we know it needs to be done. And of course you’re welcome to try to implement it - it shouldn’t be too difficult I think (though the devil is in the details)
The Deepspeed ZeRO implementation has the same issue at the moment https://github.com/microsoft/DeepSpeed/issues/1800
Have a look at this table: https://github.com/NVIDIA/Megatron-LM/blob/d50e89f1033ee1fedc5e61e98cb83b1ad043692b/README.md
it gives you a pretty good point of reference.
on the other hand I don’t think there is a hard rule or a no-no out there, we don’t have enough LLM models to provide perfect recommendations. Ideally one would train the same param-count model under different ratios and compare the results. but that’s super expensive/time consuming, so we don’t have that info.
And LLMs are somewhat different beasts from smaller models so it’s difficult to extrapolate any conclusions from small models to large ones, though there is a lot of effort is being invested into figuring that out. And surely there are already some who will give you a more scientific answer.
Getting the hparams right is the holy grail of Machine Learning. Any of them could make it or break it, so you have to do a lot of trial and error until you get to know your model and your framework.
I already gave you a hint about std-init, and now as you correctly intuited you need to figure out a good lr warmup strategy.
Your model is 5B params
I think it may be a bit too deep? so you may way to make the hidden a bit bigger and use less layers. You will have to experiment.
while we only have 1.3B and 13B around that size, feel free to try some hparam setups similar to ours: https://github.com/bigscience-workshop/bigscience/tree/master/train
(it’s the .slurm scripts in the different sub-folders)
I trust that after some trial and error you will get it right, @observerw