Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Validation during training failed

See original GitHub issue

Hi everyone! 😃

I’m facing an error during training which is hard for me to debug.

During the description of my error I want you to keep in mind that I’m still new to this project and am learning along the way, maybe I just skipped something trivial or maybe my approach to the problem is completely wrong.

The structure of this report is:

💼 What am I trying to do?
🔴 What is the unexpected error?
🔧 What I think might be the problem?
🧠 Final thoughts

Checklist

I have searched related issues but cannot get the expected help.

💼 What am I trying to do?

I am trying to use this toolbox for my master thesis. I have a dataset of myself interacting in an environment alone and my goal is to perform simple action recognition. A pre-built dataset that is similar to mine is the Something-Something dataset. Because I have labels that are different to any other dataset I have to train a model on my custom dataset with the respective label_map.txt and annotation files.

The starting point to the problem was to use the TSN model because it was where I found more documentation. So I used the config that was made to train upon the Kinetics 400 dataset and modified to train on my custom dataset like so:

# model settings
model = dict(  
    type='Recognizer2D', 
    backbone=dict(  
        type='ResNet',  
        pretrained='torchvision://resnet50',  
        depth=50, 
        norm_eval=False),  
    cls_head=dict( 
        type='TSNHead', 
        num_classes=2, 
        in_channels=2048,  
        spatial_type='avg',  
        consensus=dict(type='AvgConsensus', dim=1), 
        dropout_ratio=0.4, 
        init_std=0.01), 
        # model training and testing settings
        train_cfg=None, 
        test_cfg=dict(average_clips=None))

# dataset settings
dataset_type = 'RawframeDataset'
data_root = '/content/drive/MyDrive/colab_master_thesis/test/'  
data_root_val = '/content/drive/MyDrive/colab_master_thesis/test/' 
ann_file_train = '/content/drive/MyDrive/colab_master_thesis/gf_train_list_rawframes.txt'  
ann_file_val = '/content/drive/MyDrive/colab_master_thesis/gf_val_list_rawframes.txt'  
ann_file_test = '/content/drive/MyDrive/colab_master_thesis/gf_val_list_rawframes.txt' 
img_norm_cfg = dict(mean=[123.675, 116.28, 103.53],  std=[58.395, 57.12, 57.375], to_bgr=False) 

train_pipeline = [  
    dict(type='SampleFrames', clip_len=1, frame_interval=1, num_clips=3),  
    dict(type='RawFrameDecode'),  
    dict(type='Resize', scale=(-1, 256)), 
    dict(type='MultiScaleCrop', input_size=224, scales=(1, 0.875, 0.75, 0.66), random_crop=False, max_wh_scale_gap=1), 
    dict(type='Resize', scale=(224, 224), keep_ratio=False),  
    dict(type='Flip', flip_ratio=0.5),  
    dict(type='Normalize', **img_norm_cfg),  
    dict(type='FormatShape', input_format='NCHW'),  
    dict(type='Collect', keys=['imgs', 'label'], meta_keys=[]),  
    dict(type='ToTensor', keys=['imgs', 'label'])  
]
val_pipeline = [  
    dict(type='SampleFrames', clip_len=1, frame_interval=1, num_clips=3, test_mode=True), 
    dict(type='RawFrameDecode'),
    dict(type='Resize', scale=(-1, 256)), 
    dict(type='CenterCrop',  crop_size=224),  
    dict(type='Flip', flip_ratio=0),  
    dict(type='Normalize', **img_norm_cfg), 
    dict(type='FormatShape', input_format='NCHW'), 
    dict(type='Collect', keys=['imgs', 'label'], meta_keys=[]), 
    dict(type='ToTensor',keys=['imgs'])
]
test_pipeline = [  
    dict(type='SampleFrames', clip_len=1, frame_interval=1, num_clips=25, test_mode=True),  
    dict(type='RawFrameDecode'), 
    dict(type='Resize', scale=(-1, 256)),  
    dict(type='TenCrop', crop_size=224),  
    dict(type='Flip', flip_ratio=0),  
    dict(type='Normalize', **img_norm_cfg),  
    dict(type='FormatShape', input_format='NCHW'),  
    dict(type='Collect', keys=['imgs', 'label'], meta_keys=[]),  
    dict(type='ToTensor', keys=['imgs'])
]
data = dict(  
    videos_per_gpu=32, 
    workers_per_gpu=2,  
    train_dataloader=dict(drop_last=True),  
    val_dataloader=dict(videos_per_gpu=1),  
    test_dataloader=dict(videos_per_gpu=2),  
    train=dict(  
        type=dataset_type,
        ann_file=ann_file_train,
        data_prefix=data_root,
        pipeline=train_pipeline),
    val=dict(  
        type=dataset_type,
        ann_file=ann_file_val,
        data_prefix=data_root_val,
        pipeline=val_pipeline),
    test=dict(  
        type=dataset_type,
        ann_file=ann_file_test,
        data_prefix=data_root_val,
        pipeline=test_pipeline))

# optimizer
optimizer = dict(
    type='SGD',  
    lr=0.01, 
    momentum=0.9, 
    weight_decay=0.0001)  
optimizer_config = dict(  
    grad_clip=dict(max_norm=40, norm_type=2)) 

# learning policy
lr_config = dict(  
    policy='step',  
    step=[40, 80])  
total_epochs = 100  
checkpoint_config = dict(interval=5)  
evaluation = dict(  
    interval=5,  
    metrics=['top_k_accuracy', 'mean_class_accuracy'],  
    metric_options=dict(top_k_accuracy=dict(topk=(1, 3))), 
    save_best='top_k_accuracy')  
eval_config = dict(
    metric_options=dict(top_k_accuracy=dict(topk=(1, 3))))
log_config = dict(  
    interval=20,  
    hooks=[  
        dict(type='TextLoggerHook'),  
        # dict(type='TensorboardLoggerHook'),  # The Tensorboard logger is also supported
    ])

# runtime settings
dist_params = dict(backend='nccl') 
log_level = 'INFO' 
work_dir = './work_dirs/tsn_r50_1x1x3_100e_gf_rgb/'  
load_from = None  
resume_from = None  
workflow = [('train', 1)]

To give more context the name of my dataset is gf since thy are the initials of my first and last name and the training is being done in Google Colab since my PC isn’t powerful enough.

🔴 What is the unexpected error?

To simply train the model I use the following command:

!CUDA_LAUNCH_BLOCKING=1 python tools/train.py /content/mmaction2/configs/recognition/tsn/tsn_r50_1x1x3_100e_gf_rgb.py --validate --gpus 8

When I started training I wasn’t using the --validate optional argument and the line above run fine but the model was not accurate at all. After a more careful read through the documentation I found that this option is highly recommended since it allows the validation of the model after some training and this is where the error is occurring.

Whenever I run the line above it shows me the following:

/content/mmaction2/mmaction/utils/setup_env.py:33: UserWarning: Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
  f'Setting OMP_NUM_THREADS environment variable for each process '
/content/mmaction2/mmaction/utils/setup_env.py:43: UserWarning: Setting MKL_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
  f'Setting MKL_NUM_THREADS environment variable for each process '
tools/train.py:113: UserWarning: The Args `gpu_ids` and `gpus` are only used in non-distributed mode and we highly encourage you to use distributed mode, i.e., launch training with dist_train.sh. The two args will be deperacted.
  'The Args `gpu_ids` and `gpus` are only used in non-distributed '
tools/train.py:123: UserWarning: Non-distributed training can only use 1 gpu now. 
  warnings.warn('Non-distributed training can only use 1 gpu now. ')
2022-10-10 14:25:23,058 - mmaction - INFO - Environment info:
------------------------------------------------------------
sys.platform: linux
Python: 3.7.14 (default, Sep  8 2022, 00:06:44) [GCC 7.5.0]
CUDA available: True
GPU 0: Tesla T4
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 11.2, V11.2.152
GCC: x86_64-linux-gnu-gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
PyTorch: 1.9.0+cu111
PyTorch compiling details: PyTorch built with:
  - GCC 7.3
  - C++ Version: 201402
  - Intel(R) Math Kernel Library Version 2020.0.0 Product Build 20191122 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v2.1.2 (Git Hash 98be7e8afa711dc9b66c8ff3504129cb82013cdb)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - NNPACK is enabled
  - CPU capability usage: AVX2
  - CUDA Runtime 11.1
  - NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86
  - CuDNN 8.0.5
  - Magma 2.5.2
  - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.1, CUDNN_VERSION=8.0.5, CXX_COMPILER=/opt/rh/devtoolset-7/root/usr/bin/c++, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.9.0, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, 

TorchVision: 0.10.0+cu111
OpenCV: 4.6.0
MMCV: 1.6.2
MMCV Compiler: GCC 7.3
MMCV CUDA Compiler: 11.1
MMAction2: 0.24.1+3791db4
------------------------------------------------------------

2022-10-10 14:25:23,058 - mmaction - INFO - Distributed training: False
2022-10-10 14:25:23,692 - mmaction - INFO - Config: model = dict(
    type='Recognizer2D',
    backbone=dict(
        type='ResNet',
        pretrained='torchvision://resnet50',
        depth=50,
        norm_eval=False),
    cls_head=dict(
        type='TSNHead',
        num_classes=2,
        in_channels=2048,
        spatial_type='avg',
        consensus=dict(type='AvgConsensus', dim=1),
        dropout_ratio=0.4,
        init_std=0.01),
    train_cfg=None,
    test_cfg=dict(average_clips=None))
dataset_type = 'RawframeDataset'
data_root = '/content/drive/MyDrive/colab_master_thesis/test/'
data_root_val = '/content/drive/MyDrive/colab_master_thesis/test/'
ann_file_train = '/content/drive/MyDrive/colab_master_thesis/gf_train_list_rawframes.txt'
ann_file_val = '/content/drive/MyDrive/colab_master_thesis/gf_val_list_rawframes.txt'
ann_file_test = '/content/drive/MyDrive/colab_master_thesis/gf_val_list_rawframes.txt'
img_norm_cfg = dict(
    mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_bgr=False)
train_pipeline = [
    dict(type='SampleFrames', clip_len=1, frame_interval=1, num_clips=3),
    dict(type='RawFrameDecode'),
    dict(type='Resize', scale=(-1, 256)),
    dict(
        type='MultiScaleCrop',
        input_size=224,
        scales=(1, 0.875, 0.75, 0.66),
        random_crop=False,
        max_wh_scale_gap=1),
    dict(type='Resize', scale=(224, 224), keep_ratio=False),
    dict(type='Flip', flip_ratio=0.5),
    dict(
        type='Normalize',
        mean=[123.675, 116.28, 103.53],
        std=[58.395, 57.12, 57.375],
        to_bgr=False),
    dict(type='FormatShape', input_format='NCHW'),
    dict(type='Collect', keys=['imgs', 'label'], meta_keys=[]),
    dict(type='ToTensor', keys=['imgs', 'label'])
]
val_pipeline = [
    dict(
        type='SampleFrames',
        clip_len=1,
        frame_interval=1,
        num_clips=3,
        test_mode=True),
    dict(type='RawFrameDecode'),
    dict(type='Resize', scale=(-1, 256)),
    dict(type='CenterCrop', crop_size=224),
    dict(type='Flip', flip_ratio=0),
    dict(
        type='Normalize',
        mean=[123.675, 116.28, 103.53],
        std=[58.395, 57.12, 57.375],
        to_bgr=False),
    dict(type='FormatShape', input_format='NCHW'),
    dict(type='Collect', keys=['imgs', 'label'], meta_keys=[]),
    dict(type='ToTensor', keys=['imgs'])
]
test_pipeline = [
    dict(
        type='SampleFrames',
        clip_len=1,
        frame_interval=1,
        num_clips=25,
        test_mode=True),
    dict(type='RawFrameDecode'),
    dict(type='Resize', scale=(-1, 256)),
    dict(type='TenCrop', crop_size=224),
    dict(type='Flip', flip_ratio=0),
    dict(
        type='Normalize',
        mean=[123.675, 116.28, 103.53],
        std=[58.395, 57.12, 57.375],
        to_bgr=False),
    dict(type='FormatShape', input_format='NCHW'),
    dict(type='Collect', keys=['imgs', 'label'], meta_keys=[]),
    dict(type='ToTensor', keys=['imgs'])
]
data = dict(
    videos_per_gpu=32,
    workers_per_gpu=2,
    train_dataloader=dict(drop_last=True),
    val_dataloader=dict(videos_per_gpu=1),
    test_dataloader=dict(videos_per_gpu=2),
    train=dict(
        type='RawframeDataset',
        ann_file=
        '/content/drive/MyDrive/colab_master_thesis/gf_train_list_rawframes.txt',
        data_prefix='/content/drive/MyDrive/colab_master_thesis/test/',
        pipeline=[
            dict(
                type='SampleFrames', clip_len=1, frame_interval=1,
                num_clips=3),
            dict(type='RawFrameDecode'),
            dict(type='Resize', scale=(-1, 256)),
            dict(
                type='MultiScaleCrop',
                input_size=224,
                scales=(1, 0.875, 0.75, 0.66),
                random_crop=False,
                max_wh_scale_gap=1),
            dict(type='Resize', scale=(224, 224), keep_ratio=False),
            dict(type='Flip', flip_ratio=0.5),
            dict(
                type='Normalize',
                mean=[123.675, 116.28, 103.53],
                std=[58.395, 57.12, 57.375],
                to_bgr=False),
            dict(type='FormatShape', input_format='NCHW'),
            dict(type='Collect', keys=['imgs', 'label'], meta_keys=[]),
            dict(type='ToTensor', keys=['imgs', 'label'])
        ]),
    val=dict(
        type='RawframeDataset',
        ann_file=
        '/content/drive/MyDrive/colab_master_thesis/gf_val_list_rawframes.txt',
        data_prefix='/content/drive/MyDrive/colab_master_thesis/test/',
        pipeline=[
            dict(
                type='SampleFrames',
                clip_len=1,
                frame_interval=1,
                num_clips=3,
                test_mode=True),
            dict(type='RawFrameDecode'),
            dict(type='Resize', scale=(-1, 256)),
            dict(type='CenterCrop', crop_size=224),
            dict(type='Flip', flip_ratio=0),
            dict(
                type='Normalize',
                mean=[123.675, 116.28, 103.53],
                std=[58.395, 57.12, 57.375],
                to_bgr=False),
            dict(type='FormatShape', input_format='NCHW'),
            dict(type='Collect', keys=['imgs', 'label'], meta_keys=[]),
            dict(type='ToTensor', keys=['imgs'])
        ]),
    test=dict(
        type='RawframeDataset',
        ann_file=
        '/content/drive/MyDrive/colab_master_thesis/gf_val_list_rawframes.txt',
        data_prefix='/content/drive/MyDrive/colab_master_thesis/test/',
        pipeline=[
            dict(
                type='SampleFrames',
                clip_len=1,
                frame_interval=1,
                num_clips=25,
                test_mode=True),
            dict(type='RawFrameDecode'),
            dict(type='Resize', scale=(-1, 256)),
            dict(type='TenCrop', crop_size=224),
            dict(type='Flip', flip_ratio=0),
            dict(
                type='Normalize',
                mean=[123.675, 116.28, 103.53],
                std=[58.395, 57.12, 57.375],
                to_bgr=False),
            dict(type='FormatShape', input_format='NCHW'),
            dict(type='Collect', keys=['imgs', 'label'], meta_keys=[]),
            dict(type='ToTensor', keys=['imgs'])
        ]))
optimizer = dict(type='SGD', lr=0.01, momentum=0.9, weight_decay=0.0001)
optimizer_config = dict(grad_clip=dict(max_norm=40, norm_type=2))
lr_config = dict(policy='step', step=[40, 80])
total_epochs = 100
checkpoint_config = dict(interval=5)
evaluation = dict(
    interval=5,
    metrics=['top_k_accuracy', 'mean_class_accuracy'],
    metric_options=dict(top_k_accuracy=dict(topk=(1, 3))),
    save_best='top_k_accuracy')
eval_config = dict(metric_options=dict(top_k_accuracy=dict(topk=(1, 3))))
log_config = dict(interval=20, hooks=[dict(type='TextLoggerHook')])
dist_params = dict(backend='nccl')
log_level = 'INFO'
work_dir = './work_dirs/tsn_r50_1x1x3_100e_gf_rgb/'
load_from = None
resume_from = None
workflow = [('train', 1)]
gpu_ids = range(0, 1)
omnisource = False
module_hooks = []

2022-10-10 14:25:23,696 - mmaction - INFO - Set random seed to 1601455427, deterministic: False
load checkpoint from torchvision path: torchvision://resnet50
Downloading: "https://download.pytorch.org/models/resnet50-0676ba61.pth" to /root/.cache/torch/hub/checkpoints/resnet50-0676ba61.pth
100% 97.8M/97.8M [00:03<00:00, 28.9MB/s]
2022-10-10 14:25:27,823 - mmaction - INFO - These parameters in pretrained checkpoint are not loaded: {'fc.bias', 'fc.weight'}
2022-10-10 14:25:36,372 - mmaction - INFO - Start running, host: root@cbd9493568c3, work_dir: /content/mmaction2/work_dirs/tsn_r50_1x1x3_100e_gf_rgb
2022-10-10 14:25:36,372 - mmaction - INFO - Hooks will be executed in the following order:
before_run:
(VERY_HIGH   ) StepLrUpdaterHook                  
(NORMAL      ) CheckpointHook                     
(LOW         ) EvalHook                           
(VERY_LOW    ) TextLoggerHook                     
 -------------------- 
before_train_epoch:
(VERY_HIGH   ) StepLrUpdaterHook                  
(LOW         ) IterTimerHook                      
(LOW         ) EvalHook                           
(VERY_LOW    ) TextLoggerHook                     
 -------------------- 
before_train_iter:
(VERY_HIGH   ) StepLrUpdaterHook                  
(LOW         ) IterTimerHook                      
(LOW         ) EvalHook                           
 -------------------- 
after_train_iter:
(ABOVE_NORMAL) OptimizerHook                      
(NORMAL      ) CheckpointHook                     
(LOW         ) IterTimerHook                      
(LOW         ) EvalHook                           
(VERY_LOW    ) TextLoggerHook                     
 -------------------- 
after_train_epoch:
(NORMAL      ) CheckpointHook                     
(LOW         ) EvalHook                           
(VERY_LOW    ) TextLoggerHook                     
 -------------------- 
before_val_epoch:
(LOW         ) IterTimerHook                      
(VERY_LOW    ) TextLoggerHook                     
 -------------------- 
before_val_iter:
(LOW         ) IterTimerHook                      
 -------------------- 
after_val_iter:
(LOW         ) IterTimerHook                      
 -------------------- 
after_val_epoch:
(VERY_LOW    ) TextLoggerHook                     
 -------------------- 
after_run:
(VERY_LOW    ) TextLoggerHook                     
 -------------------- 
2022-10-10 14:25:36,372 - mmaction - INFO - workflow: [('train', 1)], max: 100 epochs
2022-10-10 14:25:36,372 - mmaction - INFO - Checkpoints will be saved to /content/mmaction2/work_dirs/tsn_r50_1x1x3_100e_gf_rgb by HardDiskBackend.
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
2022-10-10 14:25:47,390 - mmaction - INFO - Saving checkpoint at 5 epochs
[                                                  ] 0/2, elapsed: 0s, ETA:[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
/usr/local/lib/python3.7/dist-packages/torch/nn/functional.py:718: UserWarning: Named tensors and all their associated APIs are an experimental feature and subject to change. Please do not use them for anything important until they are released as stable. (Triggered internally at  /pytorch/c10/core/TensorImpl.h:1156.)
  return torch.max_pool2d(input, kernel_size, stride, padding, dilation, ceil_mode)
[>>] 2/2, 0.3 task/s, elapsed: 8s, ETA:     0s2022-10-10 14:25:55,715 - mmaction - INFO - Evaluating top_k_accuracy ...
2022-10-10 14:25:55,717 - mmaction - INFO - 
top1_acc	0.0000
top3_acc	0.5000
2022-10-10 14:25:55,717 - mmaction - INFO - Evaluating mean_class_accuracy ...
2022-10-10 14:25:55,718 - mmaction - INFO - 
mean_acc	0.0000
Traceback (most recent call last):
  File "tools/train.py", line 222, in <module>
    main()
  File "tools/train.py", line 218, in main
    meta=meta)
  File "/content/mmaction2/mmaction/apis/train.py", line 232, in train_model
    runner.run(data_loaders, cfg.workflow, cfg.total_epochs, **runner_kwargs)
  File "/usr/local/lib/python3.7/dist-packages/mmcv/runner/epoch_based_runner.py", line 136, in run
    epoch_runner(data_loaders[i], **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/mmcv/runner/epoch_based_runner.py", line 58, in train
    self.call_hook('after_train_epoch')
  File "/usr/local/lib/python3.7/dist-packages/mmcv/runner/base_runner.py", line 317, in call_hook
    getattr(hook, fn_name)(self)
  File "/usr/local/lib/python3.7/dist-packages/mmcv/runner/hooks/evaluation.py", line 271, in after_train_epoch
    self._do_evaluate(runner)
  File "/usr/local/lib/python3.7/dist-packages/mmcv/runner/hooks/evaluation.py", line 277, in _do_evaluate
    key_score = self.evaluate(runner, results)
  File "/usr/local/lib/python3.7/dist-packages/mmcv/runner/hooks/evaluation.py", line 388, in evaluate
    return eval_res[self.key_indicator]
KeyError: 'top_k_accuracy'

🔧 What I think might be the problem?

Beyond this point I don’t have much insight to where is the problem coming from. I know I am evaluating the top_k_accuracy but what does it mean it can’t find it? It’s confusing for me because I am already saying in the config file that I want to save the best top_k_accuracy.

Where do you think this problem is coming from?

🧠 Final thoughts

What I’ve done here was I just recorded some videos of me performing the actions I would like to classify (I isolated only the frames that show the action) and took a pre-built config to a base model (e.g. TSN) and input my own dataset to it and saw the results.

Additional Questions:

Should I be doing any work on the dataset or on the model before the training phase?
I have recorded 6 videos of each action (from 50-110 frames each action) and 5 videos are used for training whilst 1 is used for validation. Is this enough to build an accurate model?
Is there any option of the config that I should be more mindful, e.g., the resizing of the frames? Basically, all the parameters are as copied from the example.
Is my approach correct? If not, What is the best approach to this problem? If someone could provide a simple detailed step-by-step instruction it would be very helpful.

🙏 Thank you for your attention in advance. If you need more detailed information just let me know.

Issue Analytics

State:
Created a year ago
Comments:8

Top GitHub Comments

1reaction

hukkaicommented, Oct 18, 2022

Glad it worked. If you need any further discussion, feel free to re-open the issue.

0reactions

goncalofurtado1commented, Oct 17, 2022

@hukkai Yess it worked! It took me some time to get used to this toolbox but in the end it was simpler than I thought. Despite having a small dataset I just wanted to see if I got the result that I was expecting. It predicts fairly accurately the two classes of the dataset. Now maybe I’ll build a more concise dataset and implement the model in the remaining part of my project.

Thank You very much for your help!

Top Results From Across the Web

Validation Error less than training error? - Cross Validated

A lower validation than training error can be caused by fluctuations associated with dropout or else, but if it persists in the long...

Training & Test Error: Validating Models in Machine Learning

Calculating any form of error rate for a predictive model is called model validation. As we discussed, you need to validate your models...

What are the differences between training errors, test ... - Quora

The training error is the out-of-sample error on the training set as estimated through cross-validation or some similar procedure. The validation error is...

What's a well fitted model ? Train and validation error

An overfitted model has its validation error higher than its train error. One of the usual “cure” in that case would be to...

3. Training error vs Test error - YouTube

Classes for the Degree of Industrial Management Engineering at the University of Burgos.