question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Nan Loss while using resnet_fpn(not pretrained) backbone with FasterRCNN

See original GitHub issue

🐛 Bug

I am using this model

from torchvision.models.detection import FasterRCNN
from torchvision.models.detection.backbone_utils import resnet_fpn_backbone
backbone = resnet_fpn_backbone(backbone_name='resnet152', pretrained=False)
model = FasterRCNN(backbone,
                       num_classes=2)

When I set pretrained=True in the backbone, it works absolutely fine but when I set pretrained=False. It starts giving such output

Epoch: [0]  [  0/457]  eta: 0:27:14  lr: 0.000003  loss: 141610976.0000 (141610976.0000)  loss_classifier: 50311224.0000 (50311224.0000)  loss_box_reg: 62420652.0000 (62420652.0000)  loss_objectness: 7461720.0000 (7461720.0000)  loss_rpn_box_reg: 21417388.0000 (21417388.0000)  time: 3.5773  data: 0.8030  max mem: 10427
Loss is nan, stopping training
{'loss_classifier': tensor(nan, device='cuda:1', grad_fn=<NllLossBackward>), 'loss_box_reg': tensor(nan, device='cuda:1', grad_fn=<DivBackward0>), 'loss_objectness': tensor(nan, device='cuda:1', grad_fn=<BinaryCrossEntropyWithLogitsBackward>), 'loss_rpn_box_reg': tensor(nan, device='cuda:1', grad_fn=<DivBackward0>)}

Environment

torch-version =  1.9.0a0+c3d40fd
torchvision-version =  0.10.0a0

Using this dockerfile

FROM nvcr.io/nvidia/pytorch:21.06-py3

RUN pip install   pytorch-lightning
RUN pip install -U git+https://github.com/albu/albumentations --no-cache-dir
RUN pip install --upgrade albumentations 
RUN pip install timm
RUN pip install odach
RUN pip install ensemble_boxes
RUN pip install opencv-python-headless
RUN pip install --no-cache-dir --upgrade pip

RUN apt update && apt install -y libsm6 libxext6
RUN apt-get install -y libxrender-dev

RUN apt install -y p7zip-full p7zip-rar

Help!

cc @datumbox

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:9 (9 by maintainers)

github_iconTop GitHub Comments

3reactions
datumboxcommented, Aug 17, 2021

@sahilg06 Thanks for providing such a good example and apologies for the late response. I was able to reliably reproduce the problem.

The NaNs are the source of two separate issues:

1. Empty RPN proposals

Because the network is so deep and you initialize it with random weights, it happens that the initial proposals of RPN are of very poor quality. As a result the post processing/filtering steps removes them all. This happens at: https://github.com/pytorch/vision/blob/c8f7d772e844d707e152e2a1fa1aad26cf1b7530/torchvision/models/detection/rpn.py#L355-L356

Above the proposals are none empty but the boxes are empty. This is because of the small box removal that happens at: https://github.com/pytorch/vision/blob/c8f7d772e844d707e152e2a1fa1aad26cf1b7530/torchvision/models/detection/rpn.py#L262

The empty proposals eventually lead to calling fastrcnn_loss() with empty predictions and targets which causes the NaNs.

To ensure your model has sufficient proposals you can try setting model.rpn.min_size = 0.0 after you initialize the model. Note that you might need to tweak some of the other filtering parameters as well.

2. Too high LR, too low weight decay

By fixing the above the first loss call will be non NaN. But the subsequent runs can still lead to NaN values. This is because of the massive loss value which leads to overflows on the weight param estimation.

You can fix the issue by reconfiguring your optimizer:

optimizer = torch.optim.SGD(params, lr=1e-8, momentum=0.9, weight_decay=0.05)

Fixing the above should allow you to consistently execute your snippet without hitting nans:

tensor(5870150.5000, grad_fn=<AddBackward0>)
tensor(7591120.5000, grad_fn=<AddBackward0>)
tensor(0., grad_fn=<AddBackward0>)
tensor(0., grad_fn=<AddBackward0>)

Finally please note that as far as I see the problem is reproducible only with very deep networks such as resnet152. Using shallower networks like resnet50 usually avoids problem 1 (not problem 2 though, still you need to adapt your optimizer config). This intuitively makes sense because very deep networks initialized randomly are more likely to yield low quality initial proposals. For most applications it’s worth pre-training the backbone on a separate dataset or starting with some pre-trained weights.

I believe the above answers the question so I’m going to close the issue to keep things tidy. If you have more questions, feel free to reopen it.

0reactions
sahilg06commented, Jul 16, 2021

Hi @NicolasHug, Sorry for the very late reply. Is this okay as a minimal code with randomly generated input?

import torch
from torchvision.models.detection import FasterRCNN
from torchvision.models.detection.backbone_utils import resnet_fpn_backbone
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

backbone = resnet_fpn_backbone(backbone_name='resnet152', pretrained=False)
model = FasterRCNN(backbone,
                       num_classes=2)
model.to(device)
image = torch.rand(1,3,1024,1024)
boxes = torch.zeros((0,4), dtype=torch.float32)
labels =  torch.zeros(0, dtype=torch.int64)
areas = torch.zeros(0, dtype=torch.float32)
iscrowd = torch.zeros((0,), dtype=torch.int64)

target = {
    'boxes': boxes.to(device),
    'labels': labels.to(device),
    'image_id': torch.tensor([0], dtype=torch.int64).to(device), 
    'area': areas.to(device),
    'iscrowd': iscrowd.to(device)
}

params = [p for p in model.parameters() if p.requires_grad]
optimizer = torch.optim.SGD(params, lr=0.01, 
                            momentum=0.9, weight_decay=0.0005)
for i in range(4):
  loss_dict = model(image.to(device), [target])
  losses = sum(loss for loss in loss_dict.values())
  print(losses)
  optimizer.zero_grad()
  losses.backward()
  optimizer.step()

when I set pretrained=False in the backbone, I got the following output

tensor(nan, device='cuda:0', grad_fn=<AddBackward0>)
tensor(nan, device='cuda:0', grad_fn=<AddBackward0>)
tensor(nan, device='cuda:0', grad_fn=<AddBackward0>)
tensor(nan, device='cuda:0', grad_fn=<AddBackward0>)

and when I set ‘pretrained=True’ in the backbone, I got the following output

tensor(1.3928, device='cuda:0', grad_fn=<AddBackward0>)
tensor(0.8798, device='cuda:0', grad_fn=<AddBackward0>)
tensor(0.7465, device='cuda:0', grad_fn=<AddBackward0>)
tensor(0.2988, device='cuda:0', grad_fn=<AddBackward0>)
Read more comments on GitHub >

github_iconTop Results From Across the Web

Getting Nan Loss when using wide_resnet_fpn or resnext_fpn ...
Nan loss appears only in the case of using wide_resnet_fpn or Resnext_fpn as a backbone whereas classic resnets with fpn are working ...
Read more >
the loss go to "NAN", when I use the resnet50 as backbone #173
I want to train a resnet50 detecor on pascal, I download the pretrained resnet50 model, and change the code in ...
Read more >
NaN loss when training regression network - Stack Overflow
It seems to be there is some kind of overflow, but I can't imagine why--the loss is not unreasonably large at all. Python...
Read more >
arXiv:2107.00057v1 [cs.CV] 30 Jun 2021
We benchmark these improve- ments on the vanilla ResNet-FPN backbone with RetinaNet and RCNN detectors. The vanilla detectors are improved by.
Read more >
Exploring Deep Learning-Based Architecture, Strategies ...
Cropping operation can cause partial content loss of the desired object, and wrapping operation can produce geometric distortion. These content ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found