ReID training
See original GitHub issueThanks for your error report and we appreciate it a lot.
Checklist
- I have searched related issues but cannot get the expected help.
- The bug has not been fixed in the latest version.
Describe the bug A clear and concise description of what the bug is.
Reproduction
- What command or script did you run?
python3 ./tools/train.py configs/reid/resnet50_b32x8_MOT17.py --work-dir work_dirs/resnet50_b32x8_MOT17
- I did not make any modification on the code except dataset path
- Im running ReID training on MOT dataset
Environment
- Please run
python mmtrack/utils/collect_env.py
to collect necessary environment information and paste it here. sys.platform: linux Python: 3.8.11 (default, Jul 3 2021, 17:53:42) [GCC 7.5.0] CUDA available: True GPU 0: TITAN Xp CUDA_HOME: None GCC: gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0 PyTorch: 1.7.1+cu101 PyTorch compiling details: PyTorch built with:
- GCC 7.3
- C++ Version: 201402
- Intel® Math Kernel Library Version 2020.0.0 Product Build 20191122 for Intel® 64 architecture applications
- Intel® MKL-DNN v1.6.0 (Git Hash 5ef631a030a6f73131c77892041042805a06064f)
- OpenMP 201511 (a.k.a. OpenMP 4.5)
- NNPACK is enabled
- CPU capability usage: AVX2
- CUDA Runtime 10.1
- NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75
- CuDNN 7.6.3
- Magma 2.5.2
- Build settings: BLAS=MKL, BUILD_TYPE=Release, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DUSE_VULKAN_WRAPPER -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, USE_CUDA=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON,
TorchVision: 0.8.2+cu101 OpenCV: 4.5.3 MMCV: 1.3.11 MMCV Compiler: GCC 7.3 MMCV CUDA Compiler: 10.1 MMTracking: 0.6.0+4d78b77
- You may add addition that may be helpful for locating the problem, such as
- How you installed PyTorch [e.g., pip, conda, source]
- Other environment variables that may be related (such as
$PATH
,$LD_LIBRARY_PATH
,$PYTHONPATH
, etc.)
Error traceback If applicable, paste the error trackback here.
sys.platform: linux
Python: 3.8.11 (default, Jul 3 2021, 17:53:42) [GCC 7.5.0]
CUDA available: True
GPU 0: TITAN Xp
CUDA_HOME: None
GCC: gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0
PyTorch: 1.7.1+cu101
PyTorch compiling details: PyTorch built with:
- GCC 7.3
- C++ Version: 201402
- Intel(R) Math Kernel Library Version 2020.0.0 Product Build 20191122 for Intel(R) 64 architecture applications
- Intel(R) MKL-DNN v1.6.0 (Git Hash 5ef631a030a6f73131c77892041042805a06064f)
- OpenMP 201511 (a.k.a. OpenMP 4.5)
- NNPACK is enabled
- CPU capability usage: AVX2
- CUDA Runtime 10.1
- NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75
- CuDNN 7.6.3
- Magma 2.5.2
- Build settings: BLAS=MKL, BUILD_TYPE=Release, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DUSE_VULKAN_WRAPPER -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, USE_CUDA=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON,
TorchVision: 0.8.2+cu101
OpenCV: 4.5.3
MMCV: 1.3.11
MMCV Compiler: GCC 7.3
MMCV CUDA Compiler: 10.1
MMTracking: 0.6.0+4d78b77
------------------------------------------------------------
2021-08-17 11:24:25,348 - mmtrack - INFO - Distributed training: False
2021-08-17 11:24:26,303 - mmtrack - INFO - Config:
dataset_type = 'ReIDDataset'
img_norm_cfg = dict(
mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True)
train_pipeline = [
dict(type='LoadMultiImagesFromFile', to_float32=True),
dict(
type='SeqResize',
img_scale=(128, 256),
share_params=False,
keep_ratio=False,
bbox_clip_border=False,
override=False),
dict(
type='SeqRandomFlip',
share_params=False,
flip_ratio=0.5,
direction='horizontal'),
dict(
type='SeqNormalize',
mean=[123.675, 116.28, 103.53],
std=[58.395, 57.12, 57.375],
to_rgb=True),
dict(type='VideoCollect', keys=['img', 'gt_label']),
dict(type='ReIDFormatBundle')
]
test_pipeline = [
dict(type='LoadImageFromFile'),
dict(type='Resize', img_scale=(128, 256), keep_ratio=False),
dict(
type='Normalize',
mean=[123.675, 116.28, 103.53],
std=[58.395, 57.12, 57.375],
to_rgb=True),
dict(type='ImageToTensor', keys=['img']),
dict(type='Collect', keys=['img'], meta_keys=[])
]
data_root = '/projects/datasets/MOT/MOT17/'
data = dict(
samples_per_gpu=2,
workers_per_gpu=2,
train=dict(
type='ReIDDataset',
triplet_sampler=dict(num_ids=8, ins_per_id=4),
data_prefix='/projects/datasets/MOT/MOT17/reid/imgs',
ann_file='/projects/datasets/MOT/MOT17/reid/meta/train_80.txt',
pipeline=[
dict(type='LoadMultiImagesFromFile', to_float32=True),
dict(
type='SeqResize',
img_scale=(128, 256),
share_params=False,
keep_ratio=False,
bbox_clip_border=False,
override=False),
dict(
type='SeqRandomFlip',
share_params=False,
flip_ratio=0.5,
direction='horizontal'),
dict(
type='SeqNormalize',
mean=[123.675, 116.28, 103.53],
std=[58.395, 57.12, 57.375],
to_rgb=True),
dict(type='VideoCollect', keys=['img', 'gt_label']),
dict(type='ReIDFormatBundle')
]),
val=dict(
type='ReIDDataset',
triplet_sampler=None,
data_prefix='/projects/datasets/MOT/MOT17/reid/imgs',
ann_file='/projects/datasets/MOT/MOT17/reid/meta/val_20.txt',
pipeline=[
dict(type='LoadImageFromFile'),
dict(type='Resize', img_scale=(128, 256), keep_ratio=False),
dict(
type='Normalize',
mean=[123.675, 116.28, 103.53],
std=[58.395, 57.12, 57.375],
to_rgb=True),
dict(type='ImageToTensor', keys=['img']),
dict(type='Collect', keys=['img'], meta_keys=[])
]),
test=dict(
type='ReIDDataset',
triplet_sampler=None,
data_prefix='/projects/datasets/MOT/MOT17/reid/imgs',
ann_file='/projects/datasets/MOT/MOT17/reid/meta/val_20.txt',
pipeline=[
dict(type='LoadImageFromFile'),
dict(type='Resize', img_scale=(128, 256), keep_ratio=False),
dict(
type='Normalize',
mean=[123.675, 116.28, 103.53],
std=[58.395, 57.12, 57.375],
to_rgb=True),
dict(type='ImageToTensor', keys=['img']),
dict(type='Collect', keys=['img'], meta_keys=[])
]))
evaluation = dict(interval=1, metric='mAP')
optimizer = dict(type='SGD', lr=0.0025, momentum=0.9, weight_decay=0.0001)
optimizer_config = dict(grad_clip=None)
checkpoint_config = dict(interval=1)
log_config = dict(interval=50, hooks=[dict(type='TextLoggerHook')])
dist_params = dict(backend='nccl')
log_level = 'INFO'
load_from = None
resume_from = None
workflow = [('train', 1)]
USE_MMCLS = True
model = dict(
type='BaseReID',
backbone=dict(
type='ResNet',
depth=50,
num_stages=4,
out_indices=(3, ),
style='pytorch'),
neck=dict(type='GlobalAveragePooling', kernel_size=(8, 4), stride=1),
head=dict(
type='LinearReIDHead',
num_fcs=1,
in_channels=2048,
fc_channels=1024,
out_channels=128,
num_classes=378,
loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
loss_pairwise=dict(type='TripletLoss', margin=0.3, loss_weight=1.0),
norm_cfg=dict(type='BN1d'),
act_cfg=dict(type='ReLU')),
init_cfg=dict(
type='Pretrained',
checkpoint=
'https://download.openmmlab.com/mmclassification/v0/resnet/resnet50_batch256_imagenet_20200708-cfb998bf.pth'
))
lr_config = dict(
policy='step',
warmup='linear',
warmup_iters=1000,
warmup_ratio=0.001,
step=[5])
total_epochs = 6
work_dir = 'work_dirs/resnet50_b32x8_MOT17'
gpu_ids = range(0, 1)
2021-08-17 11:24:26,638 - mmtrack - INFO - initialize BaseReID with init_cfg {'type': 'Pretrained', 'checkpoint': 'https://download.openmmlab.com/mmclassification/v0/resnet/resnet50_batch256_imagenet_20200708-cfb998bf.pth'}
2021-08-17 11:24:26,638 - mmcv - INFO - load model from: https://download.openmmlab.com/mmclassification/v0/resnet/resnet50_batch256_imagenet_20200708-cfb998bf.pth
2021-08-17 11:24:26,638 - mmcv - INFO - Use load_from_http loader
2021-08-17 11:24:26,844 - mmcv - WARNING - The model and loaded state dict do not match exactly
unexpected key in source state_dict: head.fc.weight, head.fc.bias
missing keys in source state_dict: head.fcs.0.fc.weight, head.fcs.0.fc.bias, head.fcs.0.bn.weight, head.fcs.0.bn.bias, head.fcs.0.bn.running_mean, head.fcs.0.bn.running_var, head.fc_out.weight, head.fc_out.bias, head.bn.weight, head.bn.bias, head.bn.running_mean, head.bn.running_var, head.classifier.weight, head.classifier.bias
2021-08-17 11:24:33,803 - mmtrack - INFO - Start running, host: qljx17@gpu3, work_dir: /home2/qljx17/Open-MMLab/mmtracking/work_dirs/resnet50_b32x8_MOT17
2021-08-17 11:24:33,803 - mmtrack - INFO - Hooks will be executed in the following order:
before_run:
(VERY_HIGH ) StepLrUpdaterHook
(NORMAL ) CheckpointHook
(NORMAL ) EvalHook
(VERY_LOW ) TextLoggerHook
--------------------
before_train_epoch:
(VERY_HIGH ) StepLrUpdaterHook
(NORMAL ) EvalHook
(LOW ) IterTimerHook
(VERY_LOW ) TextLoggerHook
--------------------
before_train_iter:
(VERY_HIGH ) StepLrUpdaterHook
(NORMAL ) EvalHook
(LOW ) IterTimerHook
--------------------
after_train_iter:
(ABOVE_NORMAL) OptimizerHook
(NORMAL ) CheckpointHook
(NORMAL ) EvalHook
(LOW ) IterTimerHook
(VERY_LOW ) TextLoggerHook
--------------------
after_train_epoch:
(NORMAL ) CheckpointHook
(NORMAL ) EvalHook
(VERY_LOW ) TextLoggerHook
--------------------
before_val_epoch:
(LOW ) IterTimerHook
(VERY_LOW ) TextLoggerHook
--------------------
before_val_iter:
(LOW ) IterTimerHook
--------------------
after_val_iter:
(LOW ) IterTimerHook
--------------------
after_val_epoch:
(VERY_LOW ) TextLoggerHook
--------------------
2021-08-17 11:24:33,803 - mmtrack - INFO - workflow: [('train', 1)], max: 6 epochs
/pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:59: ClassNLLCriterion_updateOutput_no_reduce_kernel: block: [0,0,0], thread: [44,0,0] Assertion `cur_target >= 0 && cur_target < n_classes` failed.
/pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:59: ClassNLLCriterion_updateOutput_no_reduce_kernel: block: [0,0,0], thread: [45,0,0] Assertion `cur_target >= 0 && cur_target < n_classes` failed.
/pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:59: ClassNLLCriterion_updateOutput_no_reduce_kernel: block: [0,0,0], thread: [46,0,0] Assertion `cur_target >= 0 && cur_target < n_classes` failed.
/pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:59: ClassNLLCriterion_updateOutput_no_reduce_kernel: block: [0,0,0], thread: [47,0,0] Assertion `cur_target >= 0 && cur_target < n_classes` failed.
Traceback (most recent call last):
File "./tools/train.py", line 174, in <module>
main()
File "./tools/train.py", line 163, in main
train_model(
File "/home2/qljx17/Open-MMLab/mmtracking/mmtrack/apis/train.py", line 136, in train_model
runner.run(data_loaders, cfg.workflow, cfg.total_epochs)
File "/home2/qljx17/Open-MMLab/evenv/lib/python3.8/site-packages/mmcv/runner/epoch_based_runner.py", line 127, in run
epoch_runner(data_loaders[i], **kwargs)
File "/home2/qljx17/Open-MMLab/evenv/lib/python3.8/site-packages/mmcv/runner/epoch_based_runner.py", line 50, in train
self.run_iter(data_batch, train_mode=True, **kwargs)
File "/home2/qljx17/Open-MMLab/evenv/lib/python3.8/site-packages/mmcv/runner/epoch_based_runner.py", line 29, in run_iter
outputs = self.model.train_step(data_batch, self.optimizer,
File "/home2/qljx17/Open-MMLab/evenv/lib/python3.8/site-packages/mmcv/parallel/data_parallel.py", line 67, in train_step
return self.module.train_step(*inputs[0], **kwargs[0])
File "/home2/qljx17/Open-MMLab/mmclassification/mmcls/models/classifiers/base.py", line 146, in train_step
loss, log_vars = self._parse_losses(losses)
File "/home2/qljx17/Open-MMLab/mmclassification/mmcls/models/classifiers/base.py", line 97, in _parse_losses
log_vars[loss_name] = loss_value.mean()
RuntimeError: CUDA error: device-side assert triggered
terminate called after throwing an instance of 'c10::Error'
what(): CUDA error: device-side assert triggered
Exception raised from create_event_internal at /pytorch/c10/cuda/CUDACachingAllocator.cpp:687 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7fc1479138b2 in /home2/qljx17/Open-MMLab/evenv/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0xad2 (0x7fc147b65952 in /home2/qljx17/Open-MMLab/evenv/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10::TensorImpl::release_resources() + 0x4d (0x7fc1478feb7d in /home2/qljx17/Open-MMLab/evenv/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #3: <unknown function> + 0x5fd7a2 (0x7fc1920fb7a2 in /home2/qljx17/Open-MMLab/evenv/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #4: <unknown function> + 0x5fd856 (0x7fc1920fb856 in /home2/qljx17/Open-MMLab/evenv/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #5: python3() [0x534ce6]
frame #6: python3() [0x51c5d9]
frame #7: python3() [0x52cb15]
frame #8: python3() [0x52cb15]
frame #9: python3() [0x500a2e]
frame #10: python3() [0x57d905]
frame #11: python3() [0x57d8bb]
frame #12: python3() [0x57d8bb]
frame #13: python3() [0x57d8bb]
frame #14: python3() [0x57d8bb]
frame #15: python3() [0x57d8bb]
frame #16: python3() [0x57d8bb]
frame #17: python3() [0x5f25e6]
<omitting python frames>
frame #23: __libc_start_main + 0xf3 (0x7fc1a2ef10b3 in /lib/x86_64-linux-gnu/libc.so.6)
/var/spool/slurmd/job128755/slurm_script: line 21: 3941330 Aborted (core dumped) python3 ./tools/train.py configs/reid/resnet50_b32x8_MOT17.py --work-dir work_dirs/resnet50_b32x8_MOT17
^Z
Bug fix From the error above, I can assume that its because of the number of classes. From the default config, num of class is being set as 378, which is taken from train_80.txt, hence the error appear. However, when I set the num of class as 512, which is the number of samples in imgs folder, Im able to run the training without any error. Is there something that I missed, or the number of classes could be the main problem here?
Issue Analytics
- State:
- Created 2 years ago
- Comments:8 (5 by maintainers)
Hi, @yonafalinie , It’s a bug introduced from tools/convert_datasets/mot2reid.py. The script may generate different
train_80.txt
andval_20.txt
in different machines. We have fixed it in #249 .Hi, most welcome, although I am not sure if its really a bug or maybe my slight modification causes the error.