question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

FloatingPointError: Loss became infinite or NaN at iteration=556

See original GitHub issue

If you do not know the root cause of the problem / bug, and wish someone to help you, please post according to this template:

Instructions To Reproduce the Issue

  1. what changes you made (git diff) or what code you wrote
cfg = get_cfg()
cfg.merge_from_file(
    './detectron2_repo/configs/COCO-Detection/retinanet_R_101_FPN_3x.yaml')
cfg.DATASETS.TRAIN = ('component_train',)
cfg.DATASETS.TEST = ('component_val')
cfg.DATALOADER.NUM_WORKERS = 0
cfg.MODEL.WEIGHTS = "detectron2://ImageNetPretrained/MSRA/R-101.pkl"
cfg.SOLVER.IMS_PER_BATCH = 2
cfg.BASE_LR = 0.000025
cfg.SOLVER.MAX_ITER = 30000
cfg.MODEL.ROI_HEADS.BATCH_SIZE_PER_IMAGE = 64
cfg.MODEL.ROI_HEADS.NUM_CLASSES = 25
cfg.NUM_GPUS = 2

os.makedirs(cfg.OUTPUT_DIR, exist_ok=True)
trainer = DefaultTrainer(cfg)
trainer.resume_or_load(resume=False)
trainer.train()

cfg.MODEL.WEIGHTS = os.path.join(cfg.OUTPUT_DIR, 'model_final.pth')
cfg.MODEL.ROI_HEADS.SCORE_THRESH_TEST = 0.7
cfg.DATASETS.TEST = ('component_val',)
predictor = DefaultPredictor
  1. what exact command you run: python train.py

  2. what you observed (including the full logs):

Traceback (most recent call last):
  File "train.py", line 109, in <module>
    trainer.train()
  File "/workspace/detectron2/detectron2/engine/train_loop.py", line 132, in train
    super().train(self.start_iter, self.max_iter)
  File "/workspace/detectron2/detectron2/engine/train_loop.py", line 214, run_step
    self._detect_anomaly(losses, loss_dict)
  File "/workspace/detectron2/detectron2/engine/train_loop.py", line 237, in _detect_anomaly
    self.iter, loss_dict

FloatingPointError: Loss became infinite or NaN at iteration=556!
loss_dict = {'loss_cls': tensor(inf, device='cuda:0', grad_fn<DivBackward0>), 'loss_box_reg'}: tensor(5.8437e+25, device='cuda:0", grad_fn=<DivBackward0>)}

  1. please also simplify the steps as much as possible so they do not require additional resources to run, such as a private dataset.

Expected behavior

If there are no obvious error in “what you observed” provided above, please tell us the expected behavior.

If you expect the model to converge / work better, note that we do not give suggestions on how to train your model. Only in one of the two conditions we will help with it: (1) You’re unable to reproduce the results in detectron2 model zoo. (2) It indicates a detectron2 bug.

Environment

Please paste the output of python -m detectron2.utils.collect_env, or use python detectron2/utils/collect_env.py if detectron2 hasn’t been successfully installed. KakaoTalk_Photo_2019-12-19-14-39-40

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:5 (1 by maintainers)

github_iconTop GitHub Comments

3reactions
jas-natcommented, Dec 23, 2019

You should reduce your learning rate into 0.00005 for example. The loss may fluctuate due to high loss value

1reaction
jcpaynecommented, Jul 7, 2020

@jas-nat’s comment is constructive: the error is apparently caused by the learning rate being too high, which causes the optimizer to jump too far (I had the same behavior running a TridentNet model: it worked for a few hundred iterations and then overflowed). There are a few hints on building your own LR scheduler here https://github.com/facebookresearch/detectron2/issues/1224, but the easiest thing to try is just to reduce the learning rate first.

Read more comments on GitHub >

github_iconTop Results From Across the Web

File "/home/jake/detectron2/detectron2/engine/train_loop.py ...
FloatingPointError : Loss became infinite or NaN at iteration=1! loss_dict = {'loss_cls_stage0': 1.613979458808899, 'loss_box_reg_stage0': ...
Read more >
报错FloatingPointError: Loss became infinite or NaN at ...
报错FloatingPointError: Loss became infinite or NaN at iteration=1099! FloatingPointError: Loss became infinite or NaN at iteration=1099!
Read more >
FloatingPointError: Loss became infinite or NaN at iteration=1002 ...
FloatingPointError : Loss became infinite or NaN at iteration=1002! ... But when the train is in 1000 iterations, loss became infinite or NaN....
Read more >
loss became infinite or Nan at iteration = 1099! - 文章整合
Report errors FloatingPointError: Loss became infinite or NaN at iteration=1099! [04/01 15:05:09] d2.engine.train_loop ERROR: Exception ...
Read more >
FloatingPointError: Loss became infinite or NaN at iteration=88!
FloatingPointError : Loss became infinite or NaN at iteration=88! Excuse me, what kind of error is this. After the code runs for a...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found