Error when training DeepLabV3+
See original GitHub issueHello, I get the following error when attempting to train a DeeplabV3 model for a simple foreground/background segmentation. 1024x1024 images are being fed to the model, annotations are as described in the documentation, i.e. single -channel images with 0’s for one category and 1’s for the other. The machine has 2 RTX 2080Ti GPUs.
File "/home/ndserv05/Documents/Python/detectron2/detectron2/layers/aspp.py", line 135, in forward
"Input size: {} `pool_kernel_size`: {}".format(size, self.pool_kernel_size)
ValueError: `pool_kernel_size` must be divisible by the shape of inputs. Input size: torch.Size([32, 32]) `pool_kernel_size`: (32, 64)
Instructions To Reproduce the Issue:
import detectron2
from detectron2.utils.logger import setup_logger
setup_logger()
# import some common libraries
import numpy as np
import os, json, cv2, random
import pycocotools
import skimage.draw
from PIL import Image, ImageDraw
from progress.bar import Bar
import datetime
from detectron2.engine.hooks import HookBase
from detectron2.evaluation import inference_context
from detectron2.utils.logger import log_every_n_seconds
import detectron2.utils.comm as comm
import torch
import time
import logging
# import some common detectron2 utilities
from detectron2 import model_zoo
from detectron2.engine import DefaultPredictor, DefaultTrainer, launch, default_argument_parser, default_setup
from detectron2.config import get_cfg
from detectron2.utils.visualizer import Visualizer
from detectron2.data import DatasetMapper, MetadataCatalog, DatasetCatalog, build_detection_test_loader, build_detection_train_loader
import detectron2.data.transforms as T
# deeplab specific stuff
from detectron2.projects.deeplab import add_deeplab_config, build_lr_scheduler
from detectron2.evaluation import COCOEvaluator, inference_on_dataset
from detectron2.structures import BoxMode
from tools.darwin import *
categories = ["Background", "Tower foreground"]
def build_sem_seg_train_aug(cfg):
augs = [
T.ResizeShortestEdge(
cfg.INPUT.MIN_SIZE_TRAIN, cfg.INPUT.MAX_SIZE_TRAIN, cfg.INPUT.MIN_SIZE_TRAIN_SAMPLING
)
]
if cfg.INPUT.CROP.ENABLED:
augs.append(
T.RandomCrop_CategoryAreaConstraint(
cfg.INPUT.CROP.TYPE,
cfg.INPUT.CROP.SIZE,
cfg.INPUT.CROP.SINGLE_CATEGORY_MAX_AREA,
cfg.MODEL.SEM_SEG_HEAD.IGNORE_VALUE,
)
)
augs.append(T.RandomFlip())
return augs
def setup(args):
#set the number of GPUs
os.environ["CUDA_VISIBLE_DEVICES"] = "0,1"
# REGISTER DATASETS
dataset_directory = "/home/ndserv05/Documents/Data/Tower_foreground"
for d in ["train", "val"]:
# get_darwin_dataset(dataset_directory, d)
DatasetCatalog.register("tower_foreground_" + d, lambda d=d: get_darwin_dataset(dataset_directory, d, categories))
MetadataCatalog.get("tower_foreground_" + d).set(thing_classes=categories)
# CONFIGURATION
cfg = get_cfg()
add_deeplab_config(cfg)
cfg.merge_from_file("./projects/DeepLab/configs/Cityscapes-SemanticSegmentation/deeplab_v3_plus_R_103_os16_mg124_poly_90k_bs16.yaml")
cfg.OUTPUT_DIR = "./output/" + "Tower_foreground" + "{:%Y%m%dT%H%M}".format(datetime.datetime.now())
cfg.DATASETS.TRAIN = ("tower_foreground_train",)
cfg.DATASETS.TEST = ()
cfg.MODEL.WEIGHTS = "model_final_a8a355.pkl" # downloaded from https://github.com/facebookresearch/detectron2/tree/master/projects/DeepLab
cfg.SOLVER.IMS_PER_BATCH = 2
cfg.MODEL.ROI_HEADS.NUM_CLASSES = 2
cfg.MODEL.SEM_SEG_HEAD.NUM_CLASSES = 2
cfg.freeze()
default_setup(cfg, args)
return cfg
# TRAINER
class myTrainer(DefaultTrainer):
@classmethod
def build_train_loader(cls, cfg):
if "SemanticSegmentor" in cfg.MODEL.META_ARCHITECTURE:
mapper = DatasetMapper(cfg, is_train=True, augmentations=build_sem_seg_train_aug(cfg))
else:
mapper = None
return build_detection_train_loader(cfg, mapper=mapper)
@classmethod
def build_lr_scheduler(cls, cfg, optimizer):
"""
It now calls :func:`detectron2.solver.build_lr_scheduler`.
Overwrite it if you'd like a different scheduler.
"""
return build_lr_scheduler(cfg, optimizer)
def main(args):
cfg = setup(args)
os.makedirs(cfg.OUTPUT_DIR, exist_ok=True)
trainer = myTrainer(cfg)
trainer.resume_or_load(resume=False)
return trainer.train()
if __name__ == '__main__':
args = default_argument_parser().parse_args()
launch(
main,
2,
num_machines=1,
machine_rank=args.machine_rank,
dist_url=args.dist_url,
args=(args,),
)
Expected behavior:
Training as usual.
Environment:
detectron2 0.3 @/home/ndserv05/Documents/Python/detectron2/detectron2
Compiler GCC 7.5
CUDA compiler CUDA 10.0
detectron2 arch flags 7.5
DETECTRON2_ENV_MODULE <not set>
PyTorch 1.7.1 @/home/ndserv05/.local/lib/python3.6/site-packages/torch
PyTorch debug build False
GPU available True
GPU 0,1 GeForce RTX 2080 Ti (arch=7.5)
CUDA_HOME /usr/local/cuda-10.0
Pillow 8.1.0
torchvision 0.8.2 @/home/ndserv05/.local/lib/python3.6/site-packages/torchvision
torchvision arch flags 3.5, 5.0, 6.0, 7.0, 7.5
fvcore 0.1.5.post20210423
cv2 4.5.1
---------------------- --------------------------------------------------------------------
PyTorch built with:
- GCC 7.3
- C++ Version: 201402
- Intel(R) Math Kernel Library Version 2020.0.0 Product Build 20191122 for Intel(R) 64 architecture applications
- Intel(R) MKL-DNN v1.6.0 (Git Hash 5ef631a030a6f73131c77892041042805a06064f)
- OpenMP 201511 (a.k.a. OpenMP 4.5)
- NNPACK is enabled
- CPU capability usage: AVX2
- CUDA Runtime 10.2
- NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75
- CuDNN 7.6.5
- Magma 2.5.2
- Build settings: BLAS=MKL, BUILD_TYPE=Release, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DUSE_VULKAN_WRAPPER -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, USE_CUDA=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON,
Issue Analytics
- State:
- Created 2 years ago
- Comments:6
Top Results From Across the Web
tensorflow - An error occurred while training deeplabv3++ ...
“data split name train not recognized”. I found the problem after debugging: "train" no longer exit in. "_CITYSCAPES_INFORMATION.splits_to_sizes ...
Read more >What should the input to DeepLabV3 be in training mode?
I am trying to train a deeplabv3_resnet50 model on a custom dataset, but get the error ValueError: Expected more than 1 value per...
Read more >Train Deep Learning Model - Model Type not found or Error ...
Cause. The cause of the model type not being found and Error 00800 is an incompatible Meta Data Format selected when exporting training...
Read more >Document Segmentation using DeepLabV3 Semantic ...
We will be using DeepLabv3 semantic segmentation architecture to train a Document Segmentation model on a custom dataset. LearnOpenCV.
Read more >image segmentation using transfer learning - with deeplabv3 ...
After training, the model was able to successfully identify and segment ... layer to calculate the predicted error across training samples Alzubaidi et...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
It turns out all the errors above are due to lack of memory.
So I reduced the crop size to:
cfg.INPUT.CROP.SIZE = (64, 128)
And now it works. Is this expected? DeepLab does not fit into 11GB of GPU memory with (256, 512) size images?!?