question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

LayoutLMv2 nan training loss and eval

See original GitHub issue

Describe the bug Model I am using is LayoutLMv2 with custom dataset.

The problem arises when using:

  • the official example scripts: I am using the same run_funsd.py, but using a modified dataset.

To Reproduce Steps to reproduce the behavior:

run_funsd.py --do_eval=True --do_predict=True --do_train=True --early_stop_patience=4 --evaluation_strategy=epoch --fp16=True --load_best_model_at_end=True --max_train_samples=1000 --model_name_or_path=microsoft/layoutlmv2-base-uncased --num_train_epochs=30 --output_dir=/tmp/test-ner --overwrite_output_dir=True --report_to=wandb --save_strategy=epoch --save_total_limit=1 --warmup_ratio=0.1

Fortunately, I recorded everything with wandb.

image

After 8 epochs the training and eval loss went to nan, while the f1 score dropped suddenly. The samples per second increased significantly as well.

  • Platform:
  • Python version: 3.7.1
  • PyTorch version (GPU?): tesla T4

Issue Analytics

  • State:open
  • Created 2 years ago
  • Comments:11

github_iconTop GitHub Comments

1reaction
magatarocommented, May 20, 2022

Perhaps there is no problem with the loss calculation code. In my case, I got NaN value only when calculating loss with autocast(), but when I stopped using amp, I no longer get NaN value. I hope this will be helpful to you.

NaN with AMP is a known issue. https://github.com/pytorch/pytorch/issues/40497

0reactions
XueAdascommented, May 20, 2022

您发的邮件已收到,谢谢!   Your email has been received, thank you! Ihre e - mail bekommen, danke! あなたのメールが届きましたが、ありがとうございます!——————————————————————————Xue Xu    Tel: @.***

Read more comments on GitHub >

github_iconTop Results From Across the Web

`nan` training loss but eval loss does improve over time
I've been playing around with the XLSR-53 fine-tuning functionality but I keep getting nan training loss. Audio files I'm using are: Down-sampled to...
Read more >
layoutlmv2: multi-modal pre-training for visually-rich document ...
paper, we present LayoutLMv2 by pre-training text, layout and image in a multi-modal framework, where new model architectures and pre-training tasks.
Read more >
LayoutLMv2: Multi-modal Pre-training for Visually-rich ...
loss in the optimization process. 3 Experiments. 3.1 Data. In order to pre-train and evaluate LayoutLMv2 models, we select datasets in a ...
Read more >
arXiv:2012.14740v4 [cs.CL] 10 Jan 2022
LayoutLMv2 : Multi-modal Pre-training for Visually-rich. Document Understanding ... datasets as the downstream tasks to evaluate the per-.
Read more >
(PDF) LayoutLMv2: Multi-modal Pre-training for Visually-Rich ...
PDF | Pre-training of text and layout has proved effective in a variety of ... In order to pre-train and evaluate LayoutLMv2 models, ......
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found