Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Error when training DeepLabV3+

See original GitHub issue

Hello, I get the following error when attempting to train a DeeplabV3 model for a simple foreground/background segmentation. 1024x1024 images are being fed to the model, annotations are as described in the documentation, i.e. single -channel images with 0’s for one category and 1’s for the other. The machine has 2 RTX 2080Ti GPUs.

File "/home/ndserv05/Documents/Python/detectron2/detectron2/layers/aspp.py", line 135, in forward
    "Input size: {} `pool_kernel_size`: {}".format(size, self.pool_kernel_size)
ValueError: `pool_kernel_size` must be divisible by the shape of inputs. Input size: torch.Size([32, 32]) `pool_kernel_size`: (32, 64)

Instructions To Reproduce the Issue:

import detectron2
from detectron2.utils.logger import setup_logger
setup_logger()

# import some common libraries
import numpy as np
import os, json, cv2, random
import pycocotools
import skimage.draw
from PIL import Image, ImageDraw
from progress.bar import Bar
import datetime

from detectron2.engine.hooks import HookBase
from detectron2.evaluation import inference_context
from detectron2.utils.logger import log_every_n_seconds
import detectron2.utils.comm as comm
import torch
import time
import logging

# import some common detectron2 utilities
from detectron2 import model_zoo
from detectron2.engine import DefaultPredictor, DefaultTrainer, launch, default_argument_parser, default_setup
from detectron2.config import get_cfg
from detectron2.utils.visualizer import Visualizer
from detectron2.data import DatasetMapper, MetadataCatalog, DatasetCatalog, build_detection_test_loader, build_detection_train_loader
import detectron2.data.transforms as T

# deeplab specific stuff
from detectron2.projects.deeplab import add_deeplab_config, build_lr_scheduler

from detectron2.evaluation import COCOEvaluator, inference_on_dataset

from detectron2.structures import BoxMode

from tools.darwin import *

categories = ["Background", "Tower foreground"]


def build_sem_seg_train_aug(cfg):
    augs = [
        T.ResizeShortestEdge(
            cfg.INPUT.MIN_SIZE_TRAIN, cfg.INPUT.MAX_SIZE_TRAIN, cfg.INPUT.MIN_SIZE_TRAIN_SAMPLING
        )
    ]
    if cfg.INPUT.CROP.ENABLED:
        augs.append(
            T.RandomCrop_CategoryAreaConstraint(
                cfg.INPUT.CROP.TYPE,
                cfg.INPUT.CROP.SIZE,
                cfg.INPUT.CROP.SINGLE_CATEGORY_MAX_AREA,
                cfg.MODEL.SEM_SEG_HEAD.IGNORE_VALUE,
            )
        )
    augs.append(T.RandomFlip())
    return augs


def setup(args):

    #set the number of GPUs
    os.environ["CUDA_VISIBLE_DEVICES"] = "0,1"

    # REGISTER DATASETS
    dataset_directory = "/home/ndserv05/Documents/Data/Tower_foreground"

    for d in ["train", "val"]:
        # get_darwin_dataset(dataset_directory, d)
        DatasetCatalog.register("tower_foreground_" + d, lambda d=d: get_darwin_dataset(dataset_directory, d, categories))
        MetadataCatalog.get("tower_foreground_" + d).set(thing_classes=categories)

    # CONFIGURATION
    cfg = get_cfg()
    add_deeplab_config(cfg)
    cfg.merge_from_file("./projects/DeepLab/configs/Cityscapes-SemanticSegmentation/deeplab_v3_plus_R_103_os16_mg124_poly_90k_bs16.yaml")
    cfg.OUTPUT_DIR = "./output/" + "Tower_foreground" + "{:%Y%m%dT%H%M}".format(datetime.datetime.now())
    cfg.DATASETS.TRAIN = ("tower_foreground_train",)
    cfg.DATASETS.TEST = ()
    cfg.MODEL.WEIGHTS = "model_final_a8a355.pkl"  # downloaded from https://github.com/facebookresearch/detectron2/tree/master/projects/DeepLab
    cfg.SOLVER.IMS_PER_BATCH = 2
    cfg.MODEL.ROI_HEADS.NUM_CLASSES = 2
    cfg.MODEL.SEM_SEG_HEAD.NUM_CLASSES = 2

    cfg.freeze()
    default_setup(cfg, args)

    return cfg

# TRAINER
class myTrainer(DefaultTrainer):

    @classmethod
    def build_train_loader(cls, cfg):
        if "SemanticSegmentor" in cfg.MODEL.META_ARCHITECTURE:
            mapper = DatasetMapper(cfg, is_train=True, augmentations=build_sem_seg_train_aug(cfg))
        else:
            mapper = None
        return build_detection_train_loader(cfg, mapper=mapper)

    @classmethod
    def build_lr_scheduler(cls, cfg, optimizer):
        """
        It now calls :func:`detectron2.solver.build_lr_scheduler`.
        Overwrite it if you'd like a different scheduler.
        """
        return build_lr_scheduler(cfg, optimizer)


def main(args):

    cfg = setup(args)

    os.makedirs(cfg.OUTPUT_DIR, exist_ok=True)
    trainer = myTrainer(cfg)
    trainer.resume_or_load(resume=False)
    
    return trainer.train()

if __name__ == '__main__':
    args = default_argument_parser().parse_args()
    launch(
        main,
        2,
        num_machines=1,
        machine_rank=args.machine_rank,
        dist_url=args.dist_url,
        args=(args,),
    )

Expected behavior:

Training as usual.

Environment:

detectron2              0.3 @/home/ndserv05/Documents/Python/detectron2/detectron2
Compiler                GCC 7.5
CUDA compiler           CUDA 10.0
detectron2 arch flags   7.5
DETECTRON2_ENV_MODULE   <not set>
PyTorch                 1.7.1 @/home/ndserv05/.local/lib/python3.6/site-packages/torch
PyTorch debug build     False
GPU available           True
GPU 0,1                 GeForce RTX 2080 Ti (arch=7.5)
CUDA_HOME               /usr/local/cuda-10.0
Pillow                  8.1.0
torchvision             0.8.2 @/home/ndserv05/.local/lib/python3.6/site-packages/torchvision
torchvision arch flags  3.5, 5.0, 6.0, 7.0, 7.5
fvcore                  0.1.5.post20210423
cv2                     4.5.1
----------------------  --------------------------------------------------------------------
PyTorch built with:
  - GCC 7.3
  - C++ Version: 201402
  - Intel(R) Math Kernel Library Version 2020.0.0 Product Build 20191122 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v1.6.0 (Git Hash 5ef631a030a6f73131c77892041042805a06064f)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - NNPACK is enabled
  - CPU capability usage: AVX2
  - CUDA Runtime 10.2
  - NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75
  - CuDNN 7.6.5
  - Magma 2.5.2
  - Build settings: BLAS=MKL, BUILD_TYPE=Release, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DUSE_VULKAN_WRAPPER -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, USE_CUDA=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON,

Issue Analytics

State:
Created 2 years ago
Comments:6

Top GitHub Comments

1reaction

dinisovcommented, Jul 2, 2021

It turns out all the errors above are due to lack of memory.

0reactions

dinisovcommented, Jul 1, 2021

So I reduced the crop size to:

cfg.INPUT.CROP.SIZE = (64, 128)

And now it works. Is this expected? DeepLab does not fit into 11GB of GPU memory with (256, 512) size images?!?

Top Results From Across the Web

tensorflow - An error occurred while training deeplabv3++ ...

“data split name train not recognized”. I found the problem after debugging: "train" no longer exit in. "_CITYSCAPES_INFORMATION.splits_to_sizes ...

What should the input to DeepLabV3 be in training mode?

I am trying to train a deeplabv3_resnet50 model on a custom dataset, but get the error ValueError: Expected more than 1 value per...

Train Deep Learning Model - Model Type not found or Error ...

Cause. The cause of the model type not being found and Error 00800 is an incompatible Meta Data Format selected when exporting training...

Document Segmentation using DeepLabV3 Semantic ...

We will be using DeepLabv3 semantic segmentation architecture to train a Document Segmentation model on a custom dataset. LearnOpenCV.

image segmentation using transfer learning - with deeplabv3 ...

After training, the model was able to successfully identify and segment ... layer to calculate the predicted error across training samples Alzubaidi et...