Train own dataset, 4 classes getting validation mAP=0 even after more than 200 epochs
See original GitHub issueI am not sure this is “unexpected behavior”. When trying to train from scratch my own dataset, I get what seem sensible values of losses etc on the screen, but mAPs of zero after each epoch
I have my own dataset of about 4200 training images and 600 validation images. Each image is 2000x2000 Each image has around 30-40 objects. The ground truth was originally in VOC format and I converted it to Coco format. It seems to be read ok
Instructions To Reproduce the Issue:
- what changes you made (
git diff
) or what code you wrote
datasets/__init.py__:
from .coco import build as build_coco
from .zfl import build as build_zfl
...
def build_dataset(image_set, args):
print(f'build dataset {image_set}, {args.dataset_file}')
if args.dataset_file == 'coco':
return build_coco(image_set, args)
if args.dataset_file == 'coco_panoptic':
# to avoid making panopticapi required for coco
from .coco_panoptic import build as build_coco_panoptic
return build_coco_panoptic(image_set, args)
**if args.dataset_file == 'ZFL_2000': #***SAV added
return build_zfl(image_set, args)**
raise ValueError(f'dataset {args.dataset_file} not supported')
datasets/zfl.py: copied from coco.py
# Also changes max_size=1333 to max_size=800 as I am using a single GPU
# build has slightly different paths
def build(image_set, args):
print(f'build ZFL {image_set} {args.coco_path}')
root = Path(args.coco_path)
assert root.exists(), f'provided COCO path {root} does not exist'
mode = 'ZFL_2000'
PATHS = {
"train": (root / "train", root / "annotations" / f'{mode}_train.json'),
"val": (root / "val", root / "annotations" / f'{mode}_val.json'),
}
img_folder, ann_file = PATHS[image_set]
print(f'build {PATHS}')
print(f'img_folder: {img_folder} ann_file: {ann_file}')
dataset = CocoDetection(img_folder, ann_file, transforms=make_coco_transforms(image_set), return_masks=args.masks)
print(type(dataset))
return dataset
-
what exact command you run: python main.py --coco_path /home/sergio/datasets/MyData/coco2000 --dataset_file ZFL_2000 --num_classes 5 --output_dir /home/sergio/MyResults/detr/ZFL2000
-
what you observed (including full logs):
{"train_lr": 9.999999999999674e-06, "train_class_error": 4.398355509734108, "train_loss": 9.542255433653132, "train_loss_ce": 0.4571945381589478, "train_loss_bbox": 0.2981324203542087, "train_loss_giou": 0.7442877022531565, "train_loss_ce_0": 0.6183023796145857, "train_loss_bbox_0": 0.500594801518901, "train_loss_giou_0": 1.0607733845222662, "train_loss_ce_1": 0.4665685473494447, "train_loss_bbox_1": 0.2854071630322893, "train_loss_giou_1": 0.7255790654232507, "train_loss_ce_2": 0.4716044566620981, "train_loss_bbox_2": 0.27855126015386167, "train_loss_giou_2": 0.7001376669164855, "train_loss_ce_3": 0.45213553895150993, "train_loss_bbox_3": 0.28912976259807116, "train_loss_giou_3": 0.7194564093177087, "train_loss_ce_4": 0.4742862869176102, "train_loss_bbox_4": 0.2832668003391094, "train_loss_giou_4": 0.7168472326739274, "train_loss_ce_unscaled": 0.4571945381589478, "train_class_error_unscaled": 4.398355509734108, "train_loss_bbox_unscaled": 0.05962648409372106, "train_loss_giou_unscaled": 0.3721438511265783, "train_cardinality_error_unscaled": 53.24012524084778, "train_loss_ce_0_unscaled": 0.6183023796145857, "train_loss_bbox_0_unscaled": 0.10011896042705931, "train_loss_giou_0_unscaled": 0.5303866922611331, "train_cardinality_error_0_unscaled": 56.17413294797688, "train_loss_ce_1_unscaled": 0.4665685473494447, "train_loss_bbox_1_unscaled": 0.05708143253396184, "train_loss_giou_1_unscaled": 0.36278953271162534, "train_cardinality_error_1_unscaled": 53.827071290944126, "train_loss_ce_2_unscaled": 0.4716044566620981, "train_loss_bbox_2_unscaled": 0.05571025212341112, "train_loss_giou_2_unscaled": 0.35006883345824275, "train_cardinality_error_2_unscaled": 55.779865125240846, "train_loss_ce_3_unscaled": 0.45213553895150993, "train_loss_bbox_3_unscaled": 0.057825952427827815, "train_loss_giou_3_unscaled": 0.35972820465885436, "train_cardinality_error_3_unscaled": 52.947976878612714, "train_loss_ce_4_unscaled": 0.4742862869176102, "train_loss_bbox_4_unscaled": 0.05665336007939612, "train_loss_giou_4_unscaled": 0.3584236163369637, "train_cardinality_error_4_unscaled": 54.62042389210019, "test_class_error": 100.0, "test_loss": 57.28199487262302, "test_loss_ce": 1.777878213259909, "test_loss_bbox": 3.4779871246880956, "test_loss_giou": 3.7242077779438763, "test_loss_ce_0": 1.9692318725089233, "test_loss_bbox_0": 4.411114124788178, "test_loss_giou_0": 3.633108110891448, "test_loss_ce_1": 1.8099430588384469, "test_loss_bbox_1": 3.8204284417960377, "test_loss_giou_1": 3.6712065835793815, "test_loss_ce_2": 2.085218966835075, "test_loss_bbox_2": 3.657225708166758, "test_loss_giou_2": 3.625376251836618, "test_loss_ce_3": 2.8818341568112373, "test_loss_bbox_3": 3.8566246471471257, "test_loss_giou_3": 3.689193084008164, "test_loss_ce_4": 1.9929687645700243, "test_loss_bbox_4": 3.47685175223483, "test_loss_giou_4": 3.721596374279923, "test_loss_ce_unscaled": 1.777878213259909, "test_class_error_unscaled": 100.0, "test_loss_bbox_unscaled": 0.6955974255171087, "test_loss_giou_unscaled": 1.8621038889719381, "test_cardinality_error_unscaled": 45.11805555555556, "test_loss_ce_0_unscaled": 1.9692318725089233, "test_loss_bbox_0_unscaled": 0.8822228231777748, "test_loss_giou_0_unscaled": 1.816554055445724, "test_cardinality_error_0_unscaled": 54.920138888888886, "test_loss_ce_1_unscaled": 1.8099430588384469, "test_loss_bbox_1_unscaled": 0.7640856887317367, "test_loss_giou_1_unscaled": 1.8356032917896907, "test_cardinality_error_1_unscaled": 45.11805555555556, "test_loss_ce_2_unscaled": 2.085218966835075, "test_loss_bbox_2_unscaled": 0.7314451411366463, "test_loss_giou_2_unscaled": 1.812688125918309, "test_cardinality_error_2_unscaled": 45.11805555555556, "test_loss_ce_3_unscaled": 2.8818341568112373, "test_loss_bbox_3_unscaled": 0.7713249280220933, "test_loss_giou_3_unscaled": 1.844596542004082, "test_cardinality_error_3_unscaled": 45.11805555555556, "test_loss_ce_4_unscaled": 1.9929687645700243, "test_loss_bbox_4_unscaled": 0.6953703521026505, "test_loss_giou_4_unscaled": 1.8607981871399615, "test_cardinality_error_4_unscaled": 45.11805555555556, "test_coco_eval_bbox": [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0], "epoch": 290, "n_parameters": 41280266}
- please simplify the steps as much as possible so they do not require additional resources to run, such as a private dataset.
Expected behavior:
I was expecting non-zero values for mAP
Environment:
Provide your environment information using the following command:
python -m torch.utils.collect_env
PyTorch version: 1.9.0
Is debug build: False
CUDA used to build PyTorch: 10.2
ROCM used to build PyTorch: N/A
OS: Linux Mint 19 Tara (x86_64)
GCC version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
Clang version: Could not collect
CMake version: version 3.10.2
Libc version: glibc-2.17
Python version: 3.7 (64-bit runtime)
Python platform: Linux-4.15.0-163-generic-x86_64-with-debian-buster-sid
Is CUDA available: True
CUDA runtime version: 9.1.85
GPU models and configuration: GPU 0: TITAN Xp COLLECTORS EDITION
Nvidia driver version: 460.32.03
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.8.2.0
/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.2.0
/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.2.0
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.2.0
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.2.0
/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.2.0
/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.2.0
HIP runtime version: N/A
MIOpen runtime version: N/A
Versions of relevant libraries:
[pip3] numpy==1.20.3
[pip3] torch==1.9.0
[pip3] torchvision==0.10.0
[conda] blas 1.0 mkl
[conda] cudatoolkit 10.2.89 hfd86e86_1
[conda] ffmpeg 4.3 hf484d3e_0 pytorch
[conda] mkl 2021.3.0 h06a4308_520
[conda] mkl-service 2.4.0 py37h7f8727e_0
[conda] mkl_fft 1.3.0 py37h42c9631_2
[conda] mkl_random 1.2.2 py37h51133e4_0
[conda] numpy 1.20.3 py37hf144106_0
[conda] numpy-base 1.20.3 py37h74d4b33_0
[conda] pytorch 1.9.0 py3.7_cuda10.2_cudnn7.6.5_0 pytorch
[conda] torchvision 0.10.0 py37_cu102 pytorch
Issue Analytics
- State:
- Created 2 years ago
- Comments:7
Top GitHub Comments
请问,我的类别id是从0开始的,一共12种类别(即id为0~11),设置num_classes时应该为12还是13,谢谢
(如果您没有还解决的话)或许您可以看看detr.py里面的build(args)函数的num_classes的值,作者给了注释,这些值要都在真实的类别数量上加1才行。我一直在训练mAP也一直是0,改了之后mAP基本不是0了。