question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Train own dataset, 4 classes getting validation mAP=0 even after more than 200 epochs

See original GitHub issue

I am not sure this is “unexpected behavior”. When trying to train from scratch my own dataset, I get what seem sensible values of losses etc on the screen, but mAPs of zero after each epoch

I have my own dataset of about 4200 training images and 600 validation images. Each image is 2000x2000 Each image has around 30-40 objects. The ground truth was originally in VOC format and I converted it to Coco format. It seems to be read ok

Instructions To Reproduce the Issue:

  1. what changes you made (git diff) or what code you wrote
datasets/__init.py__:
from .coco import build as build_coco
from .zfl import build as build_zfl
...
def build_dataset(image_set, args):
    print(f'build dataset {image_set}, {args.dataset_file}')
          
    if args.dataset_file == 'coco':
        return build_coco(image_set, args)
    if args.dataset_file == 'coco_panoptic':
        # to avoid making panopticapi required for coco
        from .coco_panoptic import build as build_coco_panoptic
        return build_coco_panoptic(image_set, args)
    **if args.dataset_file == 'ZFL_2000': #***SAV added
        return build_zfl(image_set, args)**
    raise ValueError(f'dataset {args.dataset_file} not supported')

datasets/zfl.py: copied from coco.py
# Also changes max_size=1333 to max_size=800 as I am using a single GPU
# build has slightly different paths
def build(image_set, args):
    print(f'build ZFL {image_set} {args.coco_path}')
    root = Path(args.coco_path)
    
    assert root.exists(), f'provided COCO path {root} does not exist'
    mode = 'ZFL_2000'
    PATHS = {
        "train": (root / "train", root / "annotations" / f'{mode}_train.json'),
        "val": (root / "val", root / "annotations" / f'{mode}_val.json'),
    }

    img_folder, ann_file = PATHS[image_set]
    
    print(f'build {PATHS}')
    print(f'img_folder: {img_folder} ann_file: {ann_file}')
    dataset = CocoDetection(img_folder, ann_file, transforms=make_coco_transforms(image_set), return_masks=args.masks)
    print(type(dataset))
    return dataset
  1. what exact command you run: python main.py --coco_path /home/sergio/datasets/MyData/coco2000 --dataset_file ZFL_2000 --num_classes 5 --output_dir /home/sergio/MyResults/detr/ZFL2000

  2. what you observed (including full logs):

{"train_lr": 9.999999999999674e-06, "train_class_error": 4.398355509734108, "train_loss": 9.542255433653132, "train_loss_ce": 0.4571945381589478, "train_loss_bbox": 0.2981324203542087, "train_loss_giou": 0.7442877022531565, "train_loss_ce_0": 0.6183023796145857, "train_loss_bbox_0": 0.500594801518901, "train_loss_giou_0": 1.0607733845222662, "train_loss_ce_1": 0.4665685473494447, "train_loss_bbox_1": 0.2854071630322893, "train_loss_giou_1": 0.7255790654232507, "train_loss_ce_2": 0.4716044566620981, "train_loss_bbox_2": 0.27855126015386167, "train_loss_giou_2": 0.7001376669164855, "train_loss_ce_3": 0.45213553895150993, "train_loss_bbox_3": 0.28912976259807116, "train_loss_giou_3": 0.7194564093177087, "train_loss_ce_4": 0.4742862869176102, "train_loss_bbox_4": 0.2832668003391094, "train_loss_giou_4": 0.7168472326739274, "train_loss_ce_unscaled": 0.4571945381589478, "train_class_error_unscaled": 4.398355509734108, "train_loss_bbox_unscaled": 0.05962648409372106, "train_loss_giou_unscaled": 0.3721438511265783, "train_cardinality_error_unscaled": 53.24012524084778, "train_loss_ce_0_unscaled": 0.6183023796145857, "train_loss_bbox_0_unscaled": 0.10011896042705931, "train_loss_giou_0_unscaled": 0.5303866922611331, "train_cardinality_error_0_unscaled": 56.17413294797688, "train_loss_ce_1_unscaled": 0.4665685473494447, "train_loss_bbox_1_unscaled": 0.05708143253396184, "train_loss_giou_1_unscaled": 0.36278953271162534, "train_cardinality_error_1_unscaled": 53.827071290944126, "train_loss_ce_2_unscaled": 0.4716044566620981, "train_loss_bbox_2_unscaled": 0.05571025212341112, "train_loss_giou_2_unscaled": 0.35006883345824275, "train_cardinality_error_2_unscaled": 55.779865125240846, "train_loss_ce_3_unscaled": 0.45213553895150993, "train_loss_bbox_3_unscaled": 0.057825952427827815, "train_loss_giou_3_unscaled": 0.35972820465885436, "train_cardinality_error_3_unscaled": 52.947976878612714, "train_loss_ce_4_unscaled": 0.4742862869176102, "train_loss_bbox_4_unscaled": 0.05665336007939612, "train_loss_giou_4_unscaled": 0.3584236163369637, "train_cardinality_error_4_unscaled": 54.62042389210019, "test_class_error": 100.0, "test_loss": 57.28199487262302, "test_loss_ce": 1.777878213259909, "test_loss_bbox": 3.4779871246880956, "test_loss_giou": 3.7242077779438763, "test_loss_ce_0": 1.9692318725089233, "test_loss_bbox_0": 4.411114124788178, "test_loss_giou_0": 3.633108110891448, "test_loss_ce_1": 1.8099430588384469, "test_loss_bbox_1": 3.8204284417960377, "test_loss_giou_1": 3.6712065835793815, "test_loss_ce_2": 2.085218966835075, "test_loss_bbox_2": 3.657225708166758, "test_loss_giou_2": 3.625376251836618, "test_loss_ce_3": 2.8818341568112373, "test_loss_bbox_3": 3.8566246471471257, "test_loss_giou_3": 3.689193084008164, "test_loss_ce_4": 1.9929687645700243, "test_loss_bbox_4": 3.47685175223483, "test_loss_giou_4": 3.721596374279923, "test_loss_ce_unscaled": 1.777878213259909, "test_class_error_unscaled": 100.0, "test_loss_bbox_unscaled": 0.6955974255171087, "test_loss_giou_unscaled": 1.8621038889719381, "test_cardinality_error_unscaled": 45.11805555555556, "test_loss_ce_0_unscaled": 1.9692318725089233, "test_loss_bbox_0_unscaled": 0.8822228231777748, "test_loss_giou_0_unscaled": 1.816554055445724, "test_cardinality_error_0_unscaled": 54.920138888888886, "test_loss_ce_1_unscaled": 1.8099430588384469, "test_loss_bbox_1_unscaled": 0.7640856887317367, "test_loss_giou_1_unscaled": 1.8356032917896907, "test_cardinality_error_1_unscaled": 45.11805555555556, "test_loss_ce_2_unscaled": 2.085218966835075, "test_loss_bbox_2_unscaled": 0.7314451411366463, "test_loss_giou_2_unscaled": 1.812688125918309, "test_cardinality_error_2_unscaled": 45.11805555555556, "test_loss_ce_3_unscaled": 2.8818341568112373, "test_loss_bbox_3_unscaled": 0.7713249280220933, "test_loss_giou_3_unscaled": 1.844596542004082, "test_cardinality_error_3_unscaled": 45.11805555555556, "test_loss_ce_4_unscaled": 1.9929687645700243, "test_loss_bbox_4_unscaled": 0.6953703521026505, "test_loss_giou_4_unscaled": 1.8607981871399615, "test_cardinality_error_4_unscaled": 45.11805555555556, "test_coco_eval_bbox": [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0], "epoch": 290, "n_parameters": 41280266}
  1. please simplify the steps as much as possible so they do not require additional resources to run, such as a private dataset.

Expected behavior:

I was expecting non-zero values for mAP

Environment:

Provide your environment information using the following command:

python -m torch.utils.collect_env

PyTorch version: 1.9.0
Is debug build: False
CUDA used to build PyTorch: 10.2
ROCM used to build PyTorch: N/A

OS: Linux Mint 19 Tara (x86_64)
GCC version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
Clang version: Could not collect
CMake version: version 3.10.2
Libc version: glibc-2.17

Python version: 3.7 (64-bit runtime)
Python platform: Linux-4.15.0-163-generic-x86_64-with-debian-buster-sid
Is CUDA available: True
CUDA runtime version: 9.1.85
GPU models and configuration: GPU 0: TITAN Xp COLLECTORS EDITION
Nvidia driver version: 460.32.03
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.8.2.0
/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.2.0
/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.2.0
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.2.0
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.2.0
/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.2.0
/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.2.0
HIP runtime version: N/A
MIOpen runtime version: N/A

Versions of relevant libraries:
[pip3] numpy==1.20.3
[pip3] torch==1.9.0
[pip3] torchvision==0.10.0
[conda] blas                      1.0                         mkl  
[conda] cudatoolkit               10.2.89              hfd86e86_1  
[conda] ffmpeg                    4.3                  hf484d3e_0    pytorch
[conda] mkl                       2021.3.0           h06a4308_520  
[conda] mkl-service               2.4.0            py37h7f8727e_0  
[conda] mkl_fft                   1.3.0            py37h42c9631_2  
[conda] mkl_random                1.2.2            py37h51133e4_0  
[conda] numpy                     1.20.3           py37hf144106_0  
[conda] numpy-base                1.20.3           py37h74d4b33_0  
[conda] pytorch                   1.9.0           py3.7_cuda10.2_cudnn7.6.5_0    pytorch
[conda] torchvision               0.10.0               py37_cu102    pytorch

Issue Analytics

  • State:open
  • Created 2 years ago
  • Comments:7

github_iconTop GitHub Comments

2reactions
C1078617739commented, Nov 17, 2022

请问,我的类别id是从0开始的,一共12种类别(即id为0~11),设置num_classes时应该为12还是13,谢谢

1reaction
Moqixiscommented, May 30, 2022

(如果您没有还解决的话)或许您可以看看detr.py里面的build(args)函数的num_classes的值,作者给了注释,这些值要都在真实的类别数量上加1才行。我一直在训练mAP也一直是0,改了之后mAP基本不是0了。

Read more comments on GitHub >

github_iconTop Results From Across the Web

Training & evaluation with the built-in methods - Keras
This guide covers training, evaluation, and prediction (inference) models when using built-in APIs for training & validation (such as ...
Read more >
How Much Training Data is Required for Machine Learning?
In this post, I lay out a suite of methods that you can use to think about how much training data you need...
Read more >
4. Model Training Patterns - Machine Learning Design ...
The model fitting loops over the training dataset three times (each traversal over the training dataset is termed an epoch) with the model...
Read more >
Choose optimal number of epochs to train a neural network in ...
A part of training data is dedicated for validation of the model, ... as on validation set are monitored to look over the...
Read more >
Deep Transfer Learning for Image Classification
As you can see from the example images above, the resolution and quality ... of the validation accuracy after each epoch during training....
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found