Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

FP16 overflow with GPT-Neo when using sequence lengths of 2048.

See original GitHub issue

Environment info

transformers version: 4.5.0.dev0
Platform: Linux-5.4.0-54-generic-x86_64-with-glibc2.29
Python version: 3.8.5
PyTorch version (GPU?): 1.8.0+cu111
Tensorflow version (GPU?): N/A
Using GPU in script?: Yes
Using distributed or parallel set-up in script?: No

Who can help

@stas00

Models:

GPT-Neo 1.3b

Library:

deepspeed: @stas00

Information

Model I am using (Bert, XLNet …):

The problem arises when using:

the official example scripts: (give details below)
my own modified scripts: (give details below)

The tasks I am working on is:

an official GLUE/SQUaD task: (give the name)
my own task or dataset: (give details below)

To reproduce

Steps to reproduce the behavior:

Use GPT-Neo 1.3b with The Pile dataset and built in trainer. Artificial data also suffices. It does not matter what the data is, as long as the attention mask spans all 2048 tokens.
Enable FP16 and set max_length to 2048
Observe that all loses reported are NaN

Also reproducible using AMP or DeepSpeed. It seems like there is code to circumvent this outlined in the GPT-Neo implementation where q,k,v are casted to fp32 in the attention block.

When the max_length is shorter (512) this overflow does not occur.

Expected behavior

I expected no overflows.

Aside

I’m reaching out on behalf of EleutherAI, Lysandre told us to create an issue about this.

Issue Analytics

State:
Created 2 years ago
Reactions:5
Comments:62 (39 by maintainers)

Top GitHub Comments

1reaction

stas00commented, Apr 21, 2021

In general if you want users to be able to use fp16 mixed precision for fine-tuning and inference you need to pre-train the model using this mode. For some models we find certain workarounds that localize switching to fp32 for specific submodules, that lead to underflow/overflow under fp16, but often users still get NaNs during long training.

Bottom line, if you pre-train in bf16 be prepared to tell users to use fp32 or bf16 in their fine-tuning/inference processes. As the new hardware supporting bf16/tf32 formats emerges (rtx-3090 + a100) this will be come the simple go-to solution in the future.

Now that deepspeed will have a full-fp32 mode this is great.

So to summarize, at this moment with Samyam’s branch if you use:

zero2 you just need to do fp16.enable=false in ds config
zero3, same as above, plus zero.Init(dtype=torch.float) is needed in modeling_utils.py (instead of just zero.Init()) - I need to think how to make that configurable.

1reaction

stas00commented, Apr 20, 2021

I’m asking Deepspeed devs if they have some ideas on how to overcome this, I will keep you posted if we find a good intermediary solution.

But at the very least we now know why the model fails under fp16.

I wonder if pre-training processes targeted for mixed precision use should have a loss penalty component that forces the model to remain within fp16 dynamic range, both upper and lower.