FP16 overflow with GPT-Neo when using sequence lengths of 2048.
See original GitHub issueEnvironment info
transformers
version: 4.5.0.dev0- Platform: Linux-5.4.0-54-generic-x86_64-with-glibc2.29
- Python version: 3.8.5
- PyTorch version (GPU?): 1.8.0+cu111
- Tensorflow version (GPU?): N/A
- Using GPU in script?: Yes
- Using distributed or parallel set-up in script?: No
Who can help
Models:
- GPT-Neo 1.3b
Library:
- deepspeed: @stas00
Information
Model I am using (Bert, XLNet …):
The problem arises when using:
- the official example scripts: (give details below)
- my own modified scripts: (give details below)
The tasks I am working on is:
- an official GLUE/SQUaD task: (give the name)
- my own task or dataset: (give details below)
To reproduce
Steps to reproduce the behavior:
- Use GPT-Neo 1.3b with The Pile dataset and built in trainer. Artificial data also suffices. It does not matter what the data is, as long as the attention mask spans all 2048 tokens.
- Enable FP16 and set max_length to 2048
- Observe that all loses reported are NaN
Also reproducible using AMP or DeepSpeed. It seems like there is code to circumvent this outlined in the GPT-Neo implementation where q,k,v are casted to fp32 in the attention block.
When the max_length is shorter (512) this overflow does not occur.
Expected behavior
I expected no overflows.
Aside
I’m reaching out on behalf of EleutherAI, Lysandre told us to create an issue about this.
Issue Analytics
- State:
- Created 2 years ago
- Reactions:5
- Comments:62 (39 by maintainers)
Top Results From Across the Web
Mixed precision for bfloat16-pretrained models - Transformers
Model pre-training precision database: fp16, fp32, bf16 ... FP16 overflow with GPT-Neo when using sequence lengths of 2048.
Read more >Why Does Huggingface'S Bart Summarizer Replicate The ...
Here is code to summarize the Twitter dataset using the BART model. ... have a conda channel: FP16 overflow with GPTNeo when using...
Read more >Accelerating Transformer Networks through Recomposing ...
With regard to OpenAI GPT [5, 34, 35], a typical network model in NLP tasks, the sequence length increases from 512 to 2048...
Read more >python - implement do_sampling for custom GPT-NEO model
So, the answer would be to implement sampling :D class NEO(torch.nn.Module): def __init__(self, model): super(NEO, self).
Read more >Merge branch 'main' into neox · 5e3c7c07ae - KoboldAI-Client
numseqs = 1 # Number of sequences to ask the generator to create ... if(vars.fp32_model): # Use save_pretrained to convert fp32 models to...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
In general if you want users to be able to use fp16 mixed precision for fine-tuning and inference you need to pre-train the model using this mode. For some models we find certain workarounds that localize switching to fp32 for specific submodules, that lead to underflow/overflow under fp16, but often users still get NaNs during long training.
Bottom line, if you pre-train in bf16 be prepared to tell users to use fp32 or bf16 in their fine-tuning/inference processes. As the new hardware supporting bf16/tf32 formats emerges (rtx-3090 + a100) this will be come the simple go-to solution in the future.
Now that deepspeed will have a full-fp32 mode this is great.
So to summarize, at this moment with Samyam’s branch if you use:
fp16.enable=false
in ds configzero.Init(dtype=torch.float)
is needed inmodeling_utils.py
(instead of justzero.Init()
) - I need to think how to make that configurable.I’m asking Deepspeed devs if they have some ideas on how to overcome this, I will keep you posted if we find a good intermediary solution.
But at the very least we now know why the model fails under fp16.
I wonder if pre-training processes targeted for mixed precision use should have a loss penalty component that forces the model to remain within fp16 dynamic range, both upper and lower.