Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

FloatingPointError: Loss became infinite or NaN at iteration=556

See original GitHub issue

If you do not know the root cause of the problem / bug, and wish someone to help you, please post according to this template:

Instructions To Reproduce the Issue

what changes you made (git diff) or what code you wrote

cfg = get_cfg()
cfg.merge_from_file(
    './detectron2_repo/configs/COCO-Detection/retinanet_R_101_FPN_3x.yaml')
cfg.DATASETS.TRAIN = ('component_train',)
cfg.DATASETS.TEST = ('component_val')
cfg.DATALOADER.NUM_WORKERS = 0
cfg.MODEL.WEIGHTS = "detectron2://ImageNetPretrained/MSRA/R-101.pkl"
cfg.SOLVER.IMS_PER_BATCH = 2
cfg.BASE_LR = 0.000025
cfg.SOLVER.MAX_ITER = 30000
cfg.MODEL.ROI_HEADS.BATCH_SIZE_PER_IMAGE = 64
cfg.MODEL.ROI_HEADS.NUM_CLASSES = 25
cfg.NUM_GPUS = 2

os.makedirs(cfg.OUTPUT_DIR, exist_ok=True)
trainer = DefaultTrainer(cfg)
trainer.resume_or_load(resume=False)
trainer.train()

cfg.MODEL.WEIGHTS = os.path.join(cfg.OUTPUT_DIR, 'model_final.pth')
cfg.MODEL.ROI_HEADS.SCORE_THRESH_TEST = 0.7
cfg.DATASETS.TEST = ('component_val',)
predictor = DefaultPredictor

what exact command you run: python train.py
what you observed (including the full logs):

Traceback (most recent call last):
  File "train.py", line 109, in <module>
    trainer.train()
  File "/workspace/detectron2/detectron2/engine/train_loop.py", line 132, in train
    super().train(self.start_iter, self.max_iter)
  File "/workspace/detectron2/detectron2/engine/train_loop.py", line 214, run_step
    self._detect_anomaly(losses, loss_dict)
  File "/workspace/detectron2/detectron2/engine/train_loop.py", line 237, in _detect_anomaly
    self.iter, loss_dict

FloatingPointError: Loss became infinite or NaN at iteration=556!
loss_dict = {'loss_cls': tensor(inf, device='cuda:0', grad_fn<DivBackward0>), 'loss_box_reg'}: tensor(5.8437e+25, device='cuda:0", grad_fn=<DivBackward0>)}

please also simplify the steps as much as possible so they do not require additional resources to run, such as a private dataset.

Expected behavior

If there are no obvious error in “what you observed” provided above, please tell us the expected behavior.

If you expect the model to converge / work better, note that we do not give suggestions on how to train your model. Only in one of the two conditions we will help with it: (1) You’re unable to reproduce the results in detectron2 model zoo. (2) It indicates a detectron2 bug.

Environment

Please paste the output of python -m detectron2.utils.collect_env, or use python detectron2/utils/collect_env.py if detectron2 hasn’t been successfully installed. KakaoTalk_Photo_2019-12-19-14-39-40

Issue Analytics

State:
Created 4 years ago
Comments:5 (1 by maintainers)

Top GitHub Comments

3reactions

jas-natcommented, Dec 23, 2019

You should reduce your learning rate into 0.00005 for example. The loss may fluctuate due to high loss value

1reaction

jcpaynecommented, Jul 7, 2020

@jas-nat’s comment is constructive: the error is apparently caused by the learning rate being too high, which causes the optimizer to jump too far (I had the same behavior running a TridentNet model: it worked for a few hundred iterations and then overflowed). There are a few hints on building your own LR scheduler here https://github.com/facebookresearch/detectron2/issues/1224, but the easiest thing to try is just to reduce the learning rate first.