question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

FP16 overflow with GPT-Neo when using sequence lengths of 2048.

See original GitHub issue

Environment info

  • transformers version: 4.5.0.dev0
  • Platform: Linux-5.4.0-54-generic-x86_64-with-glibc2.29
  • Python version: 3.8.5
  • PyTorch version (GPU?): 1.8.0+cu111
  • Tensorflow version (GPU?): N/A
  • Using GPU in script?: Yes
  • Using distributed or parallel set-up in script?: No

Who can help

@stas00

Models:

  • GPT-Neo 1.3b

Library:

Information

Model I am using (Bert, XLNet …):

The problem arises when using:

  • the official example scripts: (give details below)
  • my own modified scripts: (give details below)

The tasks I am working on is:

  • an official GLUE/SQUaD task: (give the name)
  • my own task or dataset: (give details below)

To reproduce

Steps to reproduce the behavior:

  1. Use GPT-Neo 1.3b with The Pile dataset and built in trainer. Artificial data also suffices. It does not matter what the data is, as long as the attention mask spans all 2048 tokens.
  2. Enable FP16 and set max_length to 2048
  3. Observe that all loses reported are NaN

Also reproducible using AMP or DeepSpeed. It seems like there is code to circumvent this outlined in the GPT-Neo implementation where q,k,v are casted to fp32 in the attention block.

When the max_length is shorter (512) this overflow does not occur.

Expected behavior

I expected no overflows.

Aside

I’m reaching out on behalf of EleutherAI, Lysandre told us to create an issue about this.

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Reactions:5
  • Comments:62 (39 by maintainers)

github_iconTop GitHub Comments

1reaction
stas00commented, Apr 21, 2021

In general if you want users to be able to use fp16 mixed precision for fine-tuning and inference you need to pre-train the model using this mode. For some models we find certain workarounds that localize switching to fp32 for specific submodules, that lead to underflow/overflow under fp16, but often users still get NaNs during long training.

Bottom line, if you pre-train in bf16 be prepared to tell users to use fp32 or bf16 in their fine-tuning/inference processes. As the new hardware supporting bf16/tf32 formats emerges (rtx-3090 + a100) this will be come the simple go-to solution in the future.

Now that deepspeed will have a full-fp32 mode this is great.

So to summarize, at this moment with Samyam’s branch if you use:

  • zero2 you just need to do fp16.enable=false in ds config
  • zero3, same as above, plus zero.Init(dtype=torch.float) is needed in modeling_utils.py (instead of just zero.Init()) - I need to think how to make that configurable.
1reaction
stas00commented, Apr 20, 2021

I’m asking Deepspeed devs if they have some ideas on how to overcome this, I will keep you posted if we find a good intermediary solution.

But at the very least we now know why the model fails under fp16.

I wonder if pre-training processes targeted for mixed precision use should have a loss penalty component that forces the model to remain within fp16 dynamic range, both upper and lower.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Mixed precision for bfloat16-pretrained models - Transformers
Model pre-training precision database: fp16, fp32, bf16 ... FP16 overflow with GPT-Neo when using sequence lengths of 2048.
Read more >
Why Does Huggingface'S Bart Summarizer Replicate The ...
Here is code to summarize the Twitter dataset using the BART model. ... have a conda channel: FP16 overflow with GPTNeo when using...
Read more >
Accelerating Transformer Networks through Recomposing ...
With regard to OpenAI GPT [5, 34, 35], a typical network model in NLP tasks, the sequence length increases from 512 to 2048...
Read more >
python - implement do_sampling for custom GPT-NEO model
So, the answer would be to implement sampling :D class NEO(torch.nn.Module): def __init__(self, model): super(NEO, self).
Read more >
Merge branch 'main' into neox · 5e3c7c07ae - KoboldAI-Client
numseqs = 1 # Number of sequences to ask the generator to create ... if(vars.fp32_model): # Use save_pretrained to convert fp32 models to...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found