question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[BUG] gradient overflow with fp16 enabled

See original GitHub issue

Describe the bug I was trying to use deepspeed fp16 mode to train a 4.7B gpt2 model. At first it works fine, but after about 220 steps, the logs start to indicate an overflow and keep reducing the loss scale until it reaches 1 and keeps repeating, and the loss value changed to NaN.

I tried the method metioned in that answer: https://github.com/huggingface/transformers/issues/15570#issuecomment-1035306492, but after setting a higher initial scale power(I have tried 18, 24, 32), overflow even present in the first step:

steps: 1 loss: 11.4926 iter time (s): 36.217 samples/sec: 70.686
[2022-02-16 01:22:29,596] [INFO] [logging.py:69:log_dist] [Rank 0] rank=0 time (ms) | pipe_send_output: 4606.68 | pipe_recv_grad: 753.47
[2022-02-16 01:23:00,852] [INFO] [stage_1_and_2.py:1644:step] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 4294967296, reducing to 2147483648.0
[2022-02-16 01:23:00,853] [INFO] [logging.py:69:log_dist] [Rank 0] step=2, skipped=2, lr=[0.0001, 0.0001], mom=[(0.85, 0.99), (0.85, 0.99)]
[2022-02-16 01:23:00,853] [INFO] [logging.py:69:log_dist] [Rank 0] rank=0 time (ms) | batch_input: 325.46 | forward_microstep: 8068.21 | backward_microstep: 20297.51 | backward_inner_microstep: 20295.81 | backward_allreduce_microstep: 0.00 | step_microstep: 2.04
[2022-02-16 01:23:00,853] [INFO] [logging.py:69:log_dist] [Rank 0] rank=0 time (ms) | forward: 8069.56 | backward: 20296.67 | backward_inner: 20294.76 | backward_allreduce: 0.00 | step: 2.03
steps: 2 loss: 11.4975 iter time (s): 30.870 samples/sec: 82.929
[2022-02-16 01:23:00,853] [INFO] [logging.py:69:log_dist] [Rank 0] rank=0 time (ms) | pipe_send_output: 1377.88 | pipe_recv_grad: 663.08
[2022-02-16 01:23:31,730] [INFO] [stage_1_and_2.py:1644:step] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 2147483648.0, reducing to 1073741824.0
[2022-02-16 01:23:31,731] [INFO] [logging.py:69:log_dist] [Rank 0] step=3, skipped=3, lr=[0.0001, 0.0001], mom=[(0.85, 0.99), (0.85, 0.99)]
[2022-02-16 01:23:31,731] [INFO] [logging.py:69:log_dist] [Rank 0] rank=0 time (ms) | batch_input: 294.01 | forward_microstep: 8067.96 | backward_microstep: 20333.02 | backward_inner_microstep: 20331.33 | backward_allreduce_microstep: 0.00 | step_microstep: 2.23
[2022-02-16 01:23:31,731] [INFO] [logging.py:69:log_dist] [Rank 0] rank=0 time (ms) | forward: 8068.90 | backward: 20332.18 | backward_inner: 20330.27 | backward_allreduce: 0.00 | step: 2.21

I also tried this: https://github.com/microsoft/DeepSpeed/issues/697#issuecomment-767874226, but what’s worse is that this method leads to another kind of error:

AssertionError: data parallel group is not initialized

So now I don’t have any idea how to solve this problem.

To Reproduce Here’s my training script:

#! /bin/bash

GPUS_PER_NODE=6

DATETIME=`date +'date_%y-%m-%d_time_%H-%M-%S'`
LOAD_CHECKPOINT_PATH=/some_path
SAVE_CHECKPOINT_PATH=/some_path
TENSORBOARD_PATH=/workspace/tensorboard/$DATETIME

VOCAB_FILE=vocab.txt
DATA_PATH=$(cat dataset.txt)
CONFIG_JSON=deepspeed_config.json

TENSOR_PARALLEL=2
PIPELINE_PARALLEL=3
HIDDEN=3072
ATTENTION_HEADS=24
LAYERS=40
SEQ=2048
GLOBAL_BATCH=2560
MICRO_BATCH=64
TOKENS=1000000000
ZERO_STAGE=1

cat <<EOT > $CONFIG_JSON
{
  "train_batch_size" : $GLOBAL_BATCH,
  "train_micro_batch_size_per_gpu": $MICRO_BATCH,

  "steps_per_print": 1,
  "wall_clock_breakdown": true,

  "gradient_clipping": 1.0,
  "prescale_gradients": false,

  "optimizer": {
    "type": "OneBitLamb",
    "params": {
      "lr": 11e-3,
      "max_coeff": 0.3,
      "min_coeff": 0.01,
      "freeze_step": 1000,
      "cuda_aware": false,
      "comm_backend_name": "nccl",
      "coeff_beta": 0.9,
      "factor_max": 4.0,
      "factor_min": 0.5,
      "factor_threshold": 0.1
    }
  },

  "scheduler": {
    "type": "OneCycle",
    "params": {
        "cycle_first_step_size": 1000,
        "cycle_first_stair_count": 500,
        "cycle_second_step_size": 1000,
        "cycle_second_stair_count": 500,
        "decay_step_size": 1000,
        "cycle_min_lr": 0.0001,
        "cycle_max_lr": 0.0010,
        "decay_lr_rate": 0.001,
        "cycle_min_mom": 0.85,
        "cycle_max_mom": 0.99,
        "decay_mom_rate": 0.0
    }
  },

  "aio": {
    "thread_count": 8,
    "single_submit": true
  },

  "zero_optimization": {
    "stage": $ZERO_STAGE,
    "contiguous_gradients": true,
    "overlap_comm": true,
    "reduce_scatter": true,
    "reduce_bucket_size": 5e8,
    "allgather_bucket_size": 5e8
  },

  "activation_checkpointing": {
    "partition_activations": true,
    "cpu_checkpointing": true,
    "profile": true
  },

  "fp16": {
    "enabled": true,
    "loss_scale": 0,
    "loss_scale_window": 500,
    "hysteresis": 2,
    "min_loss_scale": 1,
    "initial_scale_power": 12
  },

  "curriculum_learning": {
    "enabled": true,
    "curriculum_type": "seqlen",
    "schedule_type": "fixed_linear",
    "min_difficulty": 64,
    "max_difficulty": 1024,
    "schedule_config": {
      "total_curriculum_step": 15000,
      "difficulty_step": 8
    }
  }
}
EOT

OPTIONS="--tokenizer-type EncDecTokenizer \
        --vocab-file $VOCAB_FILE \
        --tensor-model-parallel-size $TENSOR_PARALLEL \
        --pipeline-model-parallel-size $PIPELINE_PARALLEL \
        --num-layers $LAYERS \
        --hidden-size $HIDDEN \
        --num-attention-heads $ATTENTION_HEADS \
        --seq-length $SEQ \
        --max-position-embeddings $SEQ \
        --micro-batch-size $MICRO_BATCH \
        --global-batch-size $GLOBAL_BATCH \
        --train-samples 1000000000 \
        --train-tokens $TOKENS \
        --data-path $DATA_PATH \
        --save $SAVE_CHECKPOINT_PATH \
        --load $LOAD_CHECKPOINT_PATH \
        --save-interval 5000 \
        --tensorboard-dir $TENSORBOARD_PATH \
        --tensorboard-log-interval 1 \
        --checkpoint-activations \
        --checkpoint-num-layers 1 \
        --log-num-zeros-in-grad \
        --log-params-norm \
        --log-interval 100 \
        --data-impl mmap \
        --split 100,0,0 \
        --distributed-backend nccl \
        --lr 0.0001 \
        --min-lr 1.0e-5 \
        --lr-decay-style cosine \
        --lr-decay-tokens 900000000 \
        --weight-decay 1e-2 \
        --clip-grad 1.0 \
        --lr-warmup-samples 52083 \
        --fp16 \
        --fp16-lm-cross-entropy"

OPTIONS="${OPTIONS} \
		--deepspeed \
		--deepspeed_config=${CONFIG_JSON} \
		--zero-stage=${ZERO_STAGE} \
		--deepspeed-activation-checkpointing \
	"

deepspeed --num_gpus ${GPUS_PER_NODE} \
          ./pretrain_gpt.py $@ ${OPTIONS}

The gradient overflow occured in step 227 and after:

steps: 227 loss: 7.4369 iter time (s): 34.995 samples/sec: 73.153
[2022-02-15 18:51:10,812] [INFO] [logging.py:69:log_dist] [Rank 0] rank=0 time (ms) | pipe_send_output: 1271.62 | pipe_recv_grad: 558.71
[2022-02-15 18:51:45,705] [INFO] [stage_1_and_2.py:1644:step] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 4096, reducing to 4096
[2022-02-15 18:51:45,706] [INFO] [logging.py:69:log_dist] [Rank 0] step=228, skipped=1, lr=[0.00030429999999999986, 0.00030429999999999986], mom=[(0.95822, 0.99), (0.95822, 0.99)]
[2022-02-15 18:51:45,706] [INFO] [logging.py:69:log_dist] [Rank 0] rank=0 time (ms) | batch_input: 169.80 | forward_microstep: 9296.47 | backward_microstep: 23461.88 | backward_inner_microstep: 23460.12 | backward_allreduce_microstep: 0.00 | step_microstep: 2.11
[2022-02-15 18:51:45,706] [INFO] [logging.py:69:log_dist] [Rank 0] rank=0 time (ms) | forward: 9298.00 | backward: 23461.01 | backward_inner: 23458.97 | backward_allreduce: 0.00 | step: 2.08
steps: 228 loss: nan iter time (s): 34.890 samples/sec: 73.373
[2022-02-15 18:51:45,706] [INFO] [logging.py:69:log_dist] [Rank 0] rank=0 time (ms) | pipe_send_output: 1259.52 | pipe_recv_grad: 565.64
[2022-02-15 18:52:20,687] [INFO] [stage_1_and_2.py:1644:step] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 4096, reducing to 2048.0
[2022-02-15 18:52:20,687] [INFO] [logging.py:69:log_dist] [Rank 0] step=229, skipped=2, lr=[0.00030429999999999986, 0.00030429999999999986], mom=[(0.95822, 0.99), (0.95822, 0.99)]
[2022-02-15 18:52:20,687] [INFO] [logging.py:69:log_dist] [Rank 0] rank=0 time (ms) | batch_input: 151.81 | forward_microstep: 9293.15 | backward_microstep: 23479.71 | backward_inner_microstep: 23477.89 | backward_allreduce_microstep: 0.00 | step_microstep: 1.96
[2022-02-15 18:52:20,688] [INFO] [logging.py:69:log_dist] [Rank 0] rank=0 time (ms) | forward: 9294.62 | backward: 23478.80 | backward_inner: 23476.75 | backward_allreduce: 0.00 | step: 1.95

...

steps: 242 loss: nan iter time (s): 34.500 samples/sec: 74.202
[2022-02-15 18:59:50,599] [INFO] [logging.py:69:log_dist] [Rank 0] rank=0 time (ms) | pipe_send_output: 1244.14 | pipe_recv_grad: 592.68
[2022-02-15 19:00:25,037] [INFO] [stage_1_and_2.py:1644:step] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 1, reducing to 1
[2022-02-15 19:00:25,038] [INFO] [logging.py:69:log_dist] [Rank 0] step=243, skipped=15, lr=[0.00030520000000000015, 0.00030520000000000015], mom=[(0.9580799999999999, 0.99), (0.9580799999999999, 0.99)]
[2022-02-15 19:00:25,038] [INFO] [logging.py:69:log_dist] [Rank 0] rank=0 time (ms) | batch_input: 158.33 | forward_microstep: 9217.78 | backward_microstep: 23053.91 | backward_inner_microstep: 23052.11 | backward_allreduce_microstep: 0.00 | step_microstep: 2.70
[2022-02-15 19:00:25,039] [INFO] [logging.py:69:log_dist] [Rank 0] rank=0 time (ms) | forward: 9219.42 | backward: 23052.99 | backward_inner: 23050.96 | backward_allreduce: 0.00 | step: 2.68

System info (please complete the following information):

  • OS: Ubuntu 20.04
  • GPU count and types x8 NVIDIA 3090
  • Python version: 3.8
  • CUDA: 11.0

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:9 (5 by maintainers)

github_iconTop GitHub Comments

3reactions
stas00commented, Jun 27, 2022

bf16

I have only been using bf16 with https://github.com/bigscience-workshop/Megatron-DeepSpeed and it’s fantastic for huge models.

with HF Trainer I haven’t had a chance to do any serious training with bf16 so I don’t know if its solid or not besides the grad accum caveat.

The format is superior to fp16 because of its much bigger dynamic range and thus no overflows, now the various implementations could make it or break it. e.g. fp16 grad accumulation in fp16 is fine since it has a relatively high precision, but for bf16 is too lossy and will probably impact the training for the worse. So really need to implement grad accumulation in fp32 there. Probably should at least create an Issue so that we know it needs to be done. And of course you’re welcome to try to implement it - it shouldn’t be too difficult I think (though the devil is in the details)

The Deepspeed ZeRO implementation has the same issue at the moment https://github.com/microsoft/DeepSpeed/issues/1800

Hidden to layers ratio

Have a look at this table: https://github.com/NVIDIA/Megatron-LM/blob/d50e89f1033ee1fedc5e61e98cb83b1ad043692b/README.md

it gives you a pretty good point of reference.

on the other hand I don’t think there is a hard rule or a no-no out there, we don’t have enough LLM models to provide perfect recommendations. Ideally one would train the same param-count model under different ratios and compare the results. but that’s super expensive/time consuming, so we don’t have that info.

And LLMs are somewhat different beasts from smaller models so it’s difficult to extrapolate any conclusions from small models to large ones, though there is a lot of effort is being invested into figuring that out. And surely there are already some who will give you a more scientific answer.

1reaction
stas00commented, Jun 27, 2022

Getting the hparams right is the holy grail of Machine Learning. Any of them could make it or break it, so you have to do a lot of trial and error until you get to know your model and your framework.

I already gave you a hint about std-init, and now as you correctly intuited you need to figure out a good lr warmup strategy.

Your model is 5B params

NHIDDEN=3072; NLAYERS=40; SEQ_LEN=2048; VOCAB_SIZE=50257; \
python -c "h=$NHIDDEN; l=$NLAYERS; s=$SEQ_LEN; v=$VOCAB_SIZE; \
print(f'Model size: {(l*(12*h**2 + 13*h) + v*h + s*h + 2*h) / 10**9 :.0f}B, ratio={int(h/l)}')"
Model size: 5B, ratio=76

I think it may be a bit too deep? so you may way to make the hidden a bit bigger and use less layers. You will have to experiment.

while we only have 1.3B and 13B around that size, feel free to try some hparam setups similar to ours: https://github.com/bigscience-workshop/bigscience/tree/master/train

(it’s the .slurm scripts in the different sub-folders)

I trust that after some trial and error you will get it right, @observerw

Read more comments on GitHub >

github_iconTop Results From Across the Web

DeepSpeed Integration - Hugging Face
Currently it provides full support for: Optimizer state partitioning (ZeRO stage 1); Gradient partitioning (ZeRO stage 2); Parameter partitioning (ZeRO stage 3) ...
Read more >
Mixed precision training - fastai
In FP16, your gradients can easily be replaced by 0 because they are too low. Your activations or loss can overflow. The opposite...
Read more >
Train With Mixed Precision - NVIDIA Documentation Center
Porting the model to use the FP16 data type where appropriate. Adding loss scaling to preserve small gradient values. The ability to train...
Read more >
Reducing Underflow in Mixed Precision Training by Gradient ...
Gradient scale should minimize the underflow rate after type casting without causing overflow. This requirement can be formulated as a constrained optimization ...
Read more >
Training in mixed precision - | notebook.community
Your activations or loss can overflow. The opposite problem from the gradients: it's easier to hit nan (or infinity) in FP16 precision, and...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found