Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[BUG] gradient overflow with fp16 enabled

See original GitHub issue

Describe the bug I was trying to use deepspeed fp16 mode to train a 4.7B gpt2 model. At first it works fine, but after about 220 steps, the logs start to indicate an overflow and keep reducing the loss scale until it reaches 1 and keeps repeating, and the loss value changed to NaN.

I tried the method metioned in that answer: https://github.com/huggingface/transformers/issues/15570#issuecomment-1035306492, but after setting a higher initial scale power(I have tried 18, 24, 32), overflow even present in the first step:

steps: 1 loss: 11.4926 iter time (s): 36.217 samples/sec: 70.686
[2022-02-16 01:22:29,596] [INFO] [logging.py:69:log_dist] [Rank 0] rank=0 time (ms) | pipe_send_output: 4606.68 | pipe_recv_grad: 753.47
[2022-02-16 01:23:00,852] [INFO] [stage_1_and_2.py:1644:step] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 4294967296, reducing to 2147483648.0
[2022-02-16 01:23:00,853] [INFO] [logging.py:69:log_dist] [Rank 0] step=2, skipped=2, lr=[0.0001, 0.0001], mom=[(0.85, 0.99), (0.85, 0.99)]
[2022-02-16 01:23:00,853] [INFO] [logging.py:69:log_dist] [Rank 0] rank=0 time (ms) | batch_input: 325.46 | forward_microstep: 8068.21 | backward_microstep: 20297.51 | backward_inner_microstep: 20295.81 | backward_allreduce_microstep: 0.00 | step_microstep: 2.04
[2022-02-16 01:23:00,853] [INFO] [logging.py:69:log_dist] [Rank 0] rank=0 time (ms) | forward: 8069.56 | backward: 20296.67 | backward_inner: 20294.76 | backward_allreduce: 0.00 | step: 2.03
steps: 2 loss: 11.4975 iter time (s): 30.870 samples/sec: 82.929
[2022-02-16 01:23:00,853] [INFO] [logging.py:69:log_dist] [Rank 0] rank=0 time (ms) | pipe_send_output: 1377.88 | pipe_recv_grad: 663.08
[2022-02-16 01:23:31,730] [INFO] [stage_1_and_2.py:1644:step] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 2147483648.0, reducing to 1073741824.0
[2022-02-16 01:23:31,731] [INFO] [logging.py:69:log_dist] [Rank 0] step=3, skipped=3, lr=[0.0001, 0.0001], mom=[(0.85, 0.99), (0.85, 0.99)]
[2022-02-16 01:23:31,731] [INFO] [logging.py:69:log_dist] [Rank 0] rank=0 time (ms) | batch_input: 294.01 | forward_microstep: 8067.96 | backward_microstep: 20333.02 | backward_inner_microstep: 20331.33 | backward_allreduce_microstep: 0.00 | step_microstep: 2.23
[2022-02-16 01:23:31,731] [INFO] [logging.py:69:log_dist] [Rank 0] rank=0 time (ms) | forward: 8068.90 | backward: 20332.18 | backward_inner: 20330.27 | backward_allreduce: 0.00 | step: 2.21

I also tried this: https://github.com/microsoft/DeepSpeed/issues/697#issuecomment-767874226, but what’s worse is that this method leads to another kind of error:

AssertionError: data parallel group is not initialized

So now I don’t have any idea how to solve this problem.

To Reproduce Here’s my training script:

#! /bin/bash

GPUS_PER_NODE=6

DATETIME=`date +'date_%y-%m-%d_time_%H-%M-%S'`
LOAD_CHECKPOINT_PATH=/some_path
SAVE_CHECKPOINT_PATH=/some_path
TENSORBOARD_PATH=/workspace/tensorboard/$DATETIME

VOCAB_FILE=vocab.txt
DATA_PATH=$(cat dataset.txt)
CONFIG_JSON=deepspeed_config.json

TENSOR_PARALLEL=2
PIPELINE_PARALLEL=3
HIDDEN=3072
ATTENTION_HEADS=24
LAYERS=40
SEQ=2048
GLOBAL_BATCH=2560
MICRO_BATCH=64
TOKENS=1000000000
ZERO_STAGE=1

cat <<EOT > $CONFIG_JSON
{
  "train_batch_size" : $GLOBAL_BATCH,
  "train_micro_batch_size_per_gpu": $MICRO_BATCH,

  "steps_per_print": 1,
  "wall_clock_breakdown": true,

  "gradient_clipping": 1.0,
  "prescale_gradients": false,

  "optimizer": {
    "type": "OneBitLamb",
    "params": {
      "lr": 11e-3,
      "max_coeff": 0.3,
      "min_coeff": 0.01,
      "freeze_step": 1000,
      "cuda_aware": false,
      "comm_backend_name": "nccl",
      "coeff_beta": 0.9,
      "factor_max": 4.0,
      "factor_min": 0.5,
      "factor_threshold": 0.1
    }
  },

  "scheduler": {
    "type": "OneCycle",
    "params": {
        "cycle_first_step_size": 1000,
        "cycle_first_stair_count": 500,
        "cycle_second_step_size": 1000,
        "cycle_second_stair_count": 500,
        "decay_step_size": 1000,
        "cycle_min_lr": 0.0001,
        "cycle_max_lr": 0.0010,
        "decay_lr_rate": 0.001,
        "cycle_min_mom": 0.85,
        "cycle_max_mom": 0.99,
        "decay_mom_rate": 0.0
    }
  },

  "aio": {
    "thread_count": 8,
    "single_submit": true
  },

  "zero_optimization": {
    "stage": $ZERO_STAGE,
    "contiguous_gradients": true,
    "overlap_comm": true,
    "reduce_scatter": true,
    "reduce_bucket_size": 5e8,
    "allgather_bucket_size": 5e8
  },

  "activation_checkpointing": {
    "partition_activations": true,
    "cpu_checkpointing": true,
    "profile": true
  },

  "fp16": {
    "enabled": true,
    "loss_scale": 0,
    "loss_scale_window": 500,
    "hysteresis": 2,
    "min_loss_scale": 1,
    "initial_scale_power": 12
  },

  "curriculum_learning": {
    "enabled": true,
    "curriculum_type": "seqlen",
    "schedule_type": "fixed_linear",
    "min_difficulty": 64,
    "max_difficulty": 1024,
    "schedule_config": {
      "total_curriculum_step": 15000,
      "difficulty_step": 8
    }
  }
}
EOT

OPTIONS="--tokenizer-type EncDecTokenizer \
        --vocab-file $VOCAB_FILE \
        --tensor-model-parallel-size $TENSOR_PARALLEL \
        --pipeline-model-parallel-size $PIPELINE_PARALLEL \
        --num-layers $LAYERS \
        --hidden-size $HIDDEN \
        --num-attention-heads $ATTENTION_HEADS \
        --seq-length $SEQ \
        --max-position-embeddings $SEQ \
        --micro-batch-size $MICRO_BATCH \
        --global-batch-size $GLOBAL_BATCH \
        --train-samples 1000000000 \
        --train-tokens $TOKENS \
        --data-path $DATA_PATH \
        --save $SAVE_CHECKPOINT_PATH \
        --load $LOAD_CHECKPOINT_PATH \
        --save-interval 5000 \
        --tensorboard-dir $TENSORBOARD_PATH \
        --tensorboard-log-interval 1 \
        --checkpoint-activations \
        --checkpoint-num-layers 1 \
        --log-num-zeros-in-grad \
        --log-params-norm \
        --log-interval 100 \
        --data-impl mmap \
        --split 100,0,0 \
        --distributed-backend nccl \
        --lr 0.0001 \
        --min-lr 1.0e-5 \
        --lr-decay-style cosine \
        --lr-decay-tokens 900000000 \
        --weight-decay 1e-2 \
        --clip-grad 1.0 \
        --lr-warmup-samples 52083 \
        --fp16 \
        --fp16-lm-cross-entropy"

OPTIONS="${OPTIONS} \
		--deepspeed \
		--deepspeed_config=${CONFIG_JSON} \
		--zero-stage=${ZERO_STAGE} \
		--deepspeed-activation-checkpointing \
	"

deepspeed --num_gpus ${GPUS_PER_NODE} \
          ./pretrain_gpt.py $@ ${OPTIONS}

The gradient overflow occured in step 227 and after:

steps: 227 loss: 7.4369 iter time (s): 34.995 samples/sec: 73.153
[2022-02-15 18:51:10,812] [INFO] [logging.py:69:log_dist] [Rank 0] rank=0 time (ms) | pipe_send_output: 1271.62 | pipe_recv_grad: 558.71
[2022-02-15 18:51:45,705] [INFO] [stage_1_and_2.py:1644:step] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 4096, reducing to 4096
[2022-02-15 18:51:45,706] [INFO] [logging.py:69:log_dist] [Rank 0] step=228, skipped=1, lr=[0.00030429999999999986, 0.00030429999999999986], mom=[(0.95822, 0.99), (0.95822, 0.99)]
[2022-02-15 18:51:45,706] [INFO] [logging.py:69:log_dist] [Rank 0] rank=0 time (ms) | batch_input: 169.80 | forward_microstep: 9296.47 | backward_microstep: 23461.88 | backward_inner_microstep: 23460.12 | backward_allreduce_microstep: 0.00 | step_microstep: 2.11
[2022-02-15 18:51:45,706] [INFO] [logging.py:69:log_dist] [Rank 0] rank=0 time (ms) | forward: 9298.00 | backward: 23461.01 | backward_inner: 23458.97 | backward_allreduce: 0.00 | step: 2.08
steps: 228 loss: nan iter time (s): 34.890 samples/sec: 73.373
[2022-02-15 18:51:45,706] [INFO] [logging.py:69:log_dist] [Rank 0] rank=0 time (ms) | pipe_send_output: 1259.52 | pipe_recv_grad: 565.64
[2022-02-15 18:52:20,687] [INFO] [stage_1_and_2.py:1644:step] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 4096, reducing to 2048.0
[2022-02-15 18:52:20,687] [INFO] [logging.py:69:log_dist] [Rank 0] step=229, skipped=2, lr=[0.00030429999999999986, 0.00030429999999999986], mom=[(0.95822, 0.99), (0.95822, 0.99)]
[2022-02-15 18:52:20,687] [INFO] [logging.py:69:log_dist] [Rank 0] rank=0 time (ms) | batch_input: 151.81 | forward_microstep: 9293.15 | backward_microstep: 23479.71 | backward_inner_microstep: 23477.89 | backward_allreduce_microstep: 0.00 | step_microstep: 1.96
[2022-02-15 18:52:20,688] [INFO] [logging.py:69:log_dist] [Rank 0] rank=0 time (ms) | forward: 9294.62 | backward: 23478.80 | backward_inner: 23476.75 | backward_allreduce: 0.00 | step: 1.95

...

steps: 242 loss: nan iter time (s): 34.500 samples/sec: 74.202
[2022-02-15 18:59:50,599] [INFO] [logging.py:69:log_dist] [Rank 0] rank=0 time (ms) | pipe_send_output: 1244.14 | pipe_recv_grad: 592.68
[2022-02-15 19:00:25,037] [INFO] [stage_1_and_2.py:1644:step] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 1, reducing to 1
[2022-02-15 19:00:25,038] [INFO] [logging.py:69:log_dist] [Rank 0] step=243, skipped=15, lr=[0.00030520000000000015, 0.00030520000000000015], mom=[(0.9580799999999999, 0.99), (0.9580799999999999, 0.99)]
[2022-02-15 19:00:25,038] [INFO] [logging.py:69:log_dist] [Rank 0] rank=0 time (ms) | batch_input: 158.33 | forward_microstep: 9217.78 | backward_microstep: 23053.91 | backward_inner_microstep: 23052.11 | backward_allreduce_microstep: 0.00 | step_microstep: 2.70
[2022-02-15 19:00:25,039] [INFO] [logging.py:69:log_dist] [Rank 0] rank=0 time (ms) | forward: 9219.42 | backward: 23052.99 | backward_inner: 23050.96 | backward_allreduce: 0.00 | step: 2.68

System info (please complete the following information):

OS: Ubuntu 20.04
GPU count and types x8 NVIDIA 3090
Python version: 3.8
CUDA: 11.0

Issue Analytics

State:
Created 2 years ago
Comments:9 (5 by maintainers)

Top GitHub Comments

3reactions

stas00commented, Jun 27, 2022

bf16

I have only been using bf16 with https://github.com/bigscience-workshop/Megatron-DeepSpeed and it’s fantastic for huge models.

with HF Trainer I haven’t had a chance to do any serious training with bf16 so I don’t know if its solid or not besides the grad accum caveat.

The format is superior to fp16 because of its much bigger dynamic range and thus no overflows, now the various implementations could make it or break it. e.g. fp16 grad accumulation in fp16 is fine since it has a relatively high precision, but for bf16 is too lossy and will probably impact the training for the worse. So really need to implement grad accumulation in fp32 there. Probably should at least create an Issue so that we know it needs to be done. And of course you’re welcome to try to implement it - it shouldn’t be too difficult I think (though the devil is in the details)

The Deepspeed ZeRO implementation has the same issue at the moment https://github.com/microsoft/DeepSpeed/issues/1800

Hidden to layers ratio

Have a look at this table: https://github.com/NVIDIA/Megatron-LM/blob/d50e89f1033ee1fedc5e61e98cb83b1ad043692b/README.md

it gives you a pretty good point of reference.

on the other hand I don’t think there is a hard rule or a no-no out there, we don’t have enough LLM models to provide perfect recommendations. Ideally one would train the same param-count model under different ratios and compare the results. but that’s super expensive/time consuming, so we don’t have that info.

And LLMs are somewhat different beasts from smaller models so it’s difficult to extrapolate any conclusions from small models to large ones, though there is a lot of effort is being invested into figuring that out. And surely there are already some who will give you a more scientific answer.

1reaction

stas00commented, Jun 27, 2022

Getting the hparams right is the holy grail of Machine Learning. Any of them could make it or break it, so you have to do a lot of trial and error until you get to know your model and your framework.

I already gave you a hint about std-init, and now as you correctly intuited you need to figure out a good lr warmup strategy.

Your model is 5B params

NHIDDEN=3072; NLAYERS=40; SEQ_LEN=2048; VOCAB_SIZE=50257; \
python -c "h=$NHIDDEN; l=$NLAYERS; s=$SEQ_LEN; v=$VOCAB_SIZE; \
print(f'Model size: {(l*(12*h**2 + 13*h) + v*h + s*h + 2*h) / 10**9 :.0f}B, ratio={int(h/l)}')"
Model size: 5B, ratio=76

I think it may be a bit too deep? so you may way to make the hidden a bit bigger and use less layers. You will have to experiment.

while we only have 1.3B and 13B around that size, feel free to try some hparam setups similar to ours: https://github.com/bigscience-workshop/bigscience/tree/master/train

(it’s the .slurm scripts in the different sub-folders)

I trust that after some trial and error you will get it right, @observerw

Top Results From Across the Web

DeepSpeed Integration - Hugging Face

Currently it provides full support for: Optimizer state partitioning (ZeRO stage 1); Gradient partitioning (ZeRO stage 2); Parameter partitioning (ZeRO stage 3) ...

Mixed precision training - fastai

In FP16, your gradients can easily be replaced by 0 because they are too low. Your activations or loss can overflow. The opposite...

Train With Mixed Precision - NVIDIA Documentation Center

Porting the model to use the FP16 data type where appropriate. Adding loss scaling to preserve small gradient values. The ability to train...

Reducing Underflow in Mixed Precision Training by Gradient ...

Gradient scale should minimize the underflow rate after type casting without causing overflow. This requirement can be formulated as a constrained optimization ...

Training in mixed precision - | notebook.community

Your activations or loss can overflow. The opposite problem from the gradients: it's easier to hit nan (or infinity) in FP16 precision, and...