FloatingPointError: Loss became infinite or NaN at iteration=556
See original GitHub issueIf you do not know the root cause of the problem / bug, and wish someone to help you, please post according to this template:
Instructions To Reproduce the Issue
- what changes you made (
git diff
) or what code you wrote
cfg = get_cfg()
cfg.merge_from_file(
'./detectron2_repo/configs/COCO-Detection/retinanet_R_101_FPN_3x.yaml')
cfg.DATASETS.TRAIN = ('component_train',)
cfg.DATASETS.TEST = ('component_val')
cfg.DATALOADER.NUM_WORKERS = 0
cfg.MODEL.WEIGHTS = "detectron2://ImageNetPretrained/MSRA/R-101.pkl"
cfg.SOLVER.IMS_PER_BATCH = 2
cfg.BASE_LR = 0.000025
cfg.SOLVER.MAX_ITER = 30000
cfg.MODEL.ROI_HEADS.BATCH_SIZE_PER_IMAGE = 64
cfg.MODEL.ROI_HEADS.NUM_CLASSES = 25
cfg.NUM_GPUS = 2
os.makedirs(cfg.OUTPUT_DIR, exist_ok=True)
trainer = DefaultTrainer(cfg)
trainer.resume_or_load(resume=False)
trainer.train()
cfg.MODEL.WEIGHTS = os.path.join(cfg.OUTPUT_DIR, 'model_final.pth')
cfg.MODEL.ROI_HEADS.SCORE_THRESH_TEST = 0.7
cfg.DATASETS.TEST = ('component_val',)
predictor = DefaultPredictor
-
what exact command you run: python train.py
-
what you observed (including the full logs):
Traceback (most recent call last):
File "train.py", line 109, in <module>
trainer.train()
File "/workspace/detectron2/detectron2/engine/train_loop.py", line 132, in train
super().train(self.start_iter, self.max_iter)
File "/workspace/detectron2/detectron2/engine/train_loop.py", line 214, run_step
self._detect_anomaly(losses, loss_dict)
File "/workspace/detectron2/detectron2/engine/train_loop.py", line 237, in _detect_anomaly
self.iter, loss_dict
FloatingPointError: Loss became infinite or NaN at iteration=556!
loss_dict = {'loss_cls': tensor(inf, device='cuda:0', grad_fn<DivBackward0>), 'loss_box_reg'}: tensor(5.8437e+25, device='cuda:0", grad_fn=<DivBackward0>)}
- please also simplify the steps as much as possible so they do not require additional resources to run, such as a private dataset.
Expected behavior
If there are no obvious error in “what you observed” provided above, please tell us the expected behavior.
If you expect the model to converge / work better, note that we do not give suggestions on how to train your model. Only in one of the two conditions we will help with it: (1) You’re unable to reproduce the results in detectron2 model zoo. (2) It indicates a detectron2 bug.
Environment
Please paste the output of python -m detectron2.utils.collect_env
,
or use python detectron2/utils/collect_env.py
if detectron2 hasn’t been successfully installed.
Issue Analytics
- State:
- Created 4 years ago
- Comments:5 (1 by maintainers)
Top GitHub Comments
You should reduce your learning rate into 0.00005 for example. The loss may fluctuate due to high loss value
@jas-nat’s comment is constructive: the error is apparently caused by the learning rate being too high, which causes the optimizer to jump too far (I had the same behavior running a TridentNet model: it worked for a few hundred iterations and then overflowed). There are a few hints on building your own LR scheduler here https://github.com/facebookresearch/detectron2/issues/1224, but the easiest thing to try is just to reduce the learning rate first.