Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Nan Loss while using resnet_fpn(not pretrained) backbone with FasterRCNN

See original GitHub issue

🐛 Bug

I am using this model

from torchvision.models.detection import FasterRCNN
from torchvision.models.detection.backbone_utils import resnet_fpn_backbone
backbone = resnet_fpn_backbone(backbone_name='resnet152', pretrained=False)
model = FasterRCNN(backbone,
                       num_classes=2)

When I set pretrained=True in the backbone, it works absolutely fine but when I set pretrained=False. It starts giving such output

Epoch: [0]  [  0/457]  eta: 0:27:14  lr: 0.000003  loss: 141610976.0000 (141610976.0000)  loss_classifier: 50311224.0000 (50311224.0000)  loss_box_reg: 62420652.0000 (62420652.0000)  loss_objectness: 7461720.0000 (7461720.0000)  loss_rpn_box_reg: 21417388.0000 (21417388.0000)  time: 3.5773  data: 0.8030  max mem: 10427
Loss is nan, stopping training
{'loss_classifier': tensor(nan, device='cuda:1', grad_fn=<NllLossBackward>), 'loss_box_reg': tensor(nan, device='cuda:1', grad_fn=<DivBackward0>), 'loss_objectness': tensor(nan, device='cuda:1', grad_fn=<BinaryCrossEntropyWithLogitsBackward>), 'loss_rpn_box_reg': tensor(nan, device='cuda:1', grad_fn=<DivBackward0>)}

Environment

torch-version =  1.9.0a0+c3d40fd
torchvision-version =  0.10.0a0

Using this dockerfile

FROM nvcr.io/nvidia/pytorch:21.06-py3

RUN pip install   pytorch-lightning
RUN pip install -U git+https://github.com/albu/albumentations --no-cache-dir
RUN pip install --upgrade albumentations 
RUN pip install timm
RUN pip install odach
RUN pip install ensemble_boxes
RUN pip install opencv-python-headless
RUN pip install --no-cache-dir --upgrade pip

RUN apt update && apt install -y libsm6 libxext6
RUN apt-get install -y libxrender-dev

RUN apt install -y p7zip-full p7zip-rar

Help!

cc @datumbox

Issue Analytics

State:
Created 2 years ago
Comments:9 (9 by maintainers)

Top GitHub Comments

3reactions

datumboxcommented, Aug 17, 2021

@sahilg06 Thanks for providing such a good example and apologies for the late response. I was able to reliably reproduce the problem.

The NaNs are the source of two separate issues:

1. Empty RPN proposals

Because the network is so deep and you initialize it with random weights, it happens that the initial proposals of RPN are of very poor quality. As a result the post processing/filtering steps removes them all. This happens at: https://github.com/pytorch/vision/blob/c8f7d772e844d707e152e2a1fa1aad26cf1b7530/torchvision/models/detection/rpn.py#L355-L356

Above the proposals are none empty but the boxes are empty. This is because of the small box removal that happens at: https://github.com/pytorch/vision/blob/c8f7d772e844d707e152e2a1fa1aad26cf1b7530/torchvision/models/detection/rpn.py#L262

The empty proposals eventually lead to calling fastrcnn_loss() with empty predictions and targets which causes the NaNs.

To ensure your model has sufficient proposals you can try setting model.rpn.min_size = 0.0 after you initialize the model. Note that you might need to tweak some of the other filtering parameters as well.

2. Too high LR, too low weight decay

By fixing the above the first loss call will be non NaN. But the subsequent runs can still lead to NaN values. This is because of the massive loss value which leads to overflows on the weight param estimation.

You can fix the issue by reconfiguring your optimizer:

optimizer = torch.optim.SGD(params, lr=1e-8, momentum=0.9, weight_decay=0.05)

Fixing the above should allow you to consistently execute your snippet without hitting nans:

tensor(5870150.5000, grad_fn=<AddBackward0>)
tensor(7591120.5000, grad_fn=<AddBackward0>)
tensor(0., grad_fn=<AddBackward0>)
tensor(0., grad_fn=<AddBackward0>)

Finally please note that as far as I see the problem is reproducible only with very deep networks such as resnet152. Using shallower networks like resnet50 usually avoids problem 1 (not problem 2 though, still you need to adapt your optimizer config). This intuitively makes sense because very deep networks initialized randomly are more likely to yield low quality initial proposals. For most applications it’s worth pre-training the backbone on a separate dataset or starting with some pre-trained weights.

I believe the above answers the question so I’m going to close the issue to keep things tidy. If you have more questions, feel free to reopen it.

0reactions

sahilg06commented, Jul 16, 2021

Hi @NicolasHug, Sorry for the very late reply. Is this okay as a minimal code with randomly generated input?

import torch
from torchvision.models.detection import FasterRCNN
from torchvision.models.detection.backbone_utils import resnet_fpn_backbone
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

backbone = resnet_fpn_backbone(backbone_name='resnet152', pretrained=False)
model = FasterRCNN(backbone,
                       num_classes=2)
model.to(device)
image = torch.rand(1,3,1024,1024)
boxes = torch.zeros((0,4), dtype=torch.float32)
labels =  torch.zeros(0, dtype=torch.int64)
areas = torch.zeros(0, dtype=torch.float32)
iscrowd = torch.zeros((0,), dtype=torch.int64)

target = {
    'boxes': boxes.to(device),
    'labels': labels.to(device),
    'image_id': torch.tensor([0], dtype=torch.int64).to(device), 
    'area': areas.to(device),
    'iscrowd': iscrowd.to(device)
}

params = [p for p in model.parameters() if p.requires_grad]
optimizer = torch.optim.SGD(params, lr=0.01, 
                            momentum=0.9, weight_decay=0.0005)
for i in range(4):
  loss_dict = model(image.to(device), [target])
  losses = sum(loss for loss in loss_dict.values())
  print(losses)
  optimizer.zero_grad()
  losses.backward()
  optimizer.step()

when I set pretrained=False in the backbone, I got the following output

tensor(nan, device='cuda:0', grad_fn=<AddBackward0>)
tensor(nan, device='cuda:0', grad_fn=<AddBackward0>)
tensor(nan, device='cuda:0', grad_fn=<AddBackward0>)
tensor(nan, device='cuda:0', grad_fn=<AddBackward0>)

and when I set ‘pretrained=True’ in the backbone, I got the following output

tensor(1.3928, device='cuda:0', grad_fn=<AddBackward0>)
tensor(0.8798, device='cuda:0', grad_fn=<AddBackward0>)
tensor(0.7465, device='cuda:0', grad_fn=<AddBackward0>)
tensor(0.2988, device='cuda:0', grad_fn=<AddBackward0>)