question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Thanks for your error report and we appreciate it a lot.

Checklist

  1. I have searched related issues but cannot get the expected help.
  2. The bug has not been fixed in the latest version.

Describe the bug A clear and concise description of what the bug is.

Reproduction

  1. What command or script did you run?
python3 ./tools/train.py configs/reid/resnet50_b32x8_MOT17.py --work-dir work_dirs/resnet50_b32x8_MOT17
  1. I did not make any modification on the code except dataset path
  2. Im running ReID training on MOT dataset

Environment

  1. Please run python mmtrack/utils/collect_env.py to collect necessary environment information and paste it here. sys.platform: linux Python: 3.8.11 (default, Jul 3 2021, 17:53:42) [GCC 7.5.0] CUDA available: True GPU 0: TITAN Xp CUDA_HOME: None GCC: gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0 PyTorch: 1.7.1+cu101 PyTorch compiling details: PyTorch built with:
  • GCC 7.3
  • C++ Version: 201402
  • Intel® Math Kernel Library Version 2020.0.0 Product Build 20191122 for Intel® 64 architecture applications
  • Intel® MKL-DNN v1.6.0 (Git Hash 5ef631a030a6f73131c77892041042805a06064f)
  • OpenMP 201511 (a.k.a. OpenMP 4.5)
  • NNPACK is enabled
  • CPU capability usage: AVX2
  • CUDA Runtime 10.1
  • NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75
  • CuDNN 7.6.3
  • Magma 2.5.2
  • Build settings: BLAS=MKL, BUILD_TYPE=Release, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DUSE_VULKAN_WRAPPER -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, USE_CUDA=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON,

TorchVision: 0.8.2+cu101 OpenCV: 4.5.3 MMCV: 1.3.11 MMCV Compiler: GCC 7.3 MMCV CUDA Compiler: 10.1 MMTracking: 0.6.0+4d78b77

  1. You may add addition that may be helpful for locating the problem, such as
    • How you installed PyTorch [e.g., pip, conda, source]
    • Other environment variables that may be related (such as $PATH, $LD_LIBRARY_PATH, $PYTHONPATH, etc.)

Error traceback If applicable, paste the error trackback here.

sys.platform: linux
Python: 3.8.11 (default, Jul  3 2021, 17:53:42) [GCC 7.5.0]
CUDA available: True
GPU 0: TITAN Xp
CUDA_HOME: None
GCC: gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0
PyTorch: 1.7.1+cu101
PyTorch compiling details: PyTorch built with:
  - GCC 7.3
  - C++ Version: 201402
  - Intel(R) Math Kernel Library Version 2020.0.0 Product Build 20191122 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v1.6.0 (Git Hash 5ef631a030a6f73131c77892041042805a06064f)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - NNPACK is enabled
  - CPU capability usage: AVX2
  - CUDA Runtime 10.1
  - NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75
  - CuDNN 7.6.3
  - Magma 2.5.2
  - Build settings: BLAS=MKL, BUILD_TYPE=Release, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DUSE_VULKAN_WRAPPER -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, USE_CUDA=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, 

TorchVision: 0.8.2+cu101
OpenCV: 4.5.3
MMCV: 1.3.11
MMCV Compiler: GCC 7.3
MMCV CUDA Compiler: 10.1
MMTracking: 0.6.0+4d78b77
------------------------------------------------------------

2021-08-17 11:24:25,348 - mmtrack - INFO - Distributed training: False
2021-08-17 11:24:26,303 - mmtrack - INFO - Config:
dataset_type = 'ReIDDataset'
img_norm_cfg = dict(
    mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True)
train_pipeline = [
    dict(type='LoadMultiImagesFromFile', to_float32=True),
    dict(
        type='SeqResize',
        img_scale=(128, 256),
        share_params=False,
        keep_ratio=False,
        bbox_clip_border=False,
        override=False),
    dict(
        type='SeqRandomFlip',
        share_params=False,
        flip_ratio=0.5,
        direction='horizontal'),
    dict(
        type='SeqNormalize',
        mean=[123.675, 116.28, 103.53],
        std=[58.395, 57.12, 57.375],
        to_rgb=True),
    dict(type='VideoCollect', keys=['img', 'gt_label']),
    dict(type='ReIDFormatBundle')
]
test_pipeline = [
    dict(type='LoadImageFromFile'),
    dict(type='Resize', img_scale=(128, 256), keep_ratio=False),
    dict(
        type='Normalize',
        mean=[123.675, 116.28, 103.53],
        std=[58.395, 57.12, 57.375],
        to_rgb=True),
    dict(type='ImageToTensor', keys=['img']),
    dict(type='Collect', keys=['img'], meta_keys=[])
]
data_root = '/projects/datasets/MOT/MOT17/'
data = dict(
    samples_per_gpu=2,
    workers_per_gpu=2,
    train=dict(
        type='ReIDDataset',
        triplet_sampler=dict(num_ids=8, ins_per_id=4),
        data_prefix='/projects/datasets/MOT/MOT17/reid/imgs',
        ann_file='/projects/datasets/MOT/MOT17/reid/meta/train_80.txt',
        pipeline=[
            dict(type='LoadMultiImagesFromFile', to_float32=True),
            dict(
                type='SeqResize',
                img_scale=(128, 256),
                share_params=False,
                keep_ratio=False,
                bbox_clip_border=False,
                override=False),
            dict(
                type='SeqRandomFlip',
                share_params=False,
                flip_ratio=0.5,
                direction='horizontal'),
            dict(
                type='SeqNormalize',
                mean=[123.675, 116.28, 103.53],
                std=[58.395, 57.12, 57.375],
                to_rgb=True),
            dict(type='VideoCollect', keys=['img', 'gt_label']),
            dict(type='ReIDFormatBundle')
        ]),
    val=dict(
        type='ReIDDataset',
        triplet_sampler=None,
        data_prefix='/projects/datasets/MOT/MOT17/reid/imgs',
        ann_file='/projects/datasets/MOT/MOT17/reid/meta/val_20.txt',
        pipeline=[
            dict(type='LoadImageFromFile'),
            dict(type='Resize', img_scale=(128, 256), keep_ratio=False),
            dict(
                type='Normalize',
                mean=[123.675, 116.28, 103.53],
                std=[58.395, 57.12, 57.375],
                to_rgb=True),
            dict(type='ImageToTensor', keys=['img']),
            dict(type='Collect', keys=['img'], meta_keys=[])
        ]),
    test=dict(
        type='ReIDDataset',
        triplet_sampler=None,
        data_prefix='/projects/datasets/MOT/MOT17/reid/imgs',
        ann_file='/projects/datasets/MOT/MOT17/reid/meta/val_20.txt',
        pipeline=[
            dict(type='LoadImageFromFile'),
            dict(type='Resize', img_scale=(128, 256), keep_ratio=False),
            dict(
                type='Normalize',
                mean=[123.675, 116.28, 103.53],
                std=[58.395, 57.12, 57.375],
                to_rgb=True),
            dict(type='ImageToTensor', keys=['img']),
            dict(type='Collect', keys=['img'], meta_keys=[])
        ]))
evaluation = dict(interval=1, metric='mAP')
optimizer = dict(type='SGD', lr=0.0025, momentum=0.9, weight_decay=0.0001)
optimizer_config = dict(grad_clip=None)
checkpoint_config = dict(interval=1)
log_config = dict(interval=50, hooks=[dict(type='TextLoggerHook')])
dist_params = dict(backend='nccl')
log_level = 'INFO'
load_from = None
resume_from = None
workflow = [('train', 1)]
USE_MMCLS = True
model = dict(
    type='BaseReID',
    backbone=dict(
        type='ResNet',
        depth=50,
        num_stages=4,
        out_indices=(3, ),
        style='pytorch'),
    neck=dict(type='GlobalAveragePooling', kernel_size=(8, 4), stride=1),
    head=dict(
        type='LinearReIDHead',
        num_fcs=1,
        in_channels=2048,
        fc_channels=1024,
        out_channels=128,
        num_classes=378,
        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
        loss_pairwise=dict(type='TripletLoss', margin=0.3, loss_weight=1.0),
        norm_cfg=dict(type='BN1d'),
        act_cfg=dict(type='ReLU')),
    init_cfg=dict(
        type='Pretrained',
        checkpoint=
        'https://download.openmmlab.com/mmclassification/v0/resnet/resnet50_batch256_imagenet_20200708-cfb998bf.pth'
    ))
lr_config = dict(
    policy='step',
    warmup='linear',
    warmup_iters=1000,
    warmup_ratio=0.001,
    step=[5])
total_epochs = 6
work_dir = 'work_dirs/resnet50_b32x8_MOT17'
gpu_ids = range(0, 1)

2021-08-17 11:24:26,638 - mmtrack - INFO - initialize BaseReID with init_cfg {'type': 'Pretrained', 'checkpoint': 'https://download.openmmlab.com/mmclassification/v0/resnet/resnet50_batch256_imagenet_20200708-cfb998bf.pth'}
2021-08-17 11:24:26,638 - mmcv - INFO - load model from: https://download.openmmlab.com/mmclassification/v0/resnet/resnet50_batch256_imagenet_20200708-cfb998bf.pth
2021-08-17 11:24:26,638 - mmcv - INFO - Use load_from_http loader
2021-08-17 11:24:26,844 - mmcv - WARNING - The model and loaded state dict do not match exactly

unexpected key in source state_dict: head.fc.weight, head.fc.bias

missing keys in source state_dict: head.fcs.0.fc.weight, head.fcs.0.fc.bias, head.fcs.0.bn.weight, head.fcs.0.bn.bias, head.fcs.0.bn.running_mean, head.fcs.0.bn.running_var, head.fc_out.weight, head.fc_out.bias, head.bn.weight, head.bn.bias, head.bn.running_mean, head.bn.running_var, head.classifier.weight, head.classifier.bias

2021-08-17 11:24:33,803 - mmtrack - INFO - Start running, host: qljx17@gpu3, work_dir: /home2/qljx17/Open-MMLab/mmtracking/work_dirs/resnet50_b32x8_MOT17
2021-08-17 11:24:33,803 - mmtrack - INFO - Hooks will be executed in the following order:
before_run:
(VERY_HIGH   ) StepLrUpdaterHook                  
(NORMAL      ) CheckpointHook                     
(NORMAL      ) EvalHook                           
(VERY_LOW    ) TextLoggerHook                     
 -------------------- 
before_train_epoch:
(VERY_HIGH   ) StepLrUpdaterHook                  
(NORMAL      ) EvalHook                           
(LOW         ) IterTimerHook                      
(VERY_LOW    ) TextLoggerHook                     
 -------------------- 
before_train_iter:
(VERY_HIGH   ) StepLrUpdaterHook                  
(NORMAL      ) EvalHook                           
(LOW         ) IterTimerHook                      
 -------------------- 
after_train_iter:
(ABOVE_NORMAL) OptimizerHook                      
(NORMAL      ) CheckpointHook                     
(NORMAL      ) EvalHook                           
(LOW         ) IterTimerHook                      
(VERY_LOW    ) TextLoggerHook                     
 -------------------- 
after_train_epoch:
(NORMAL      ) CheckpointHook                     
(NORMAL      ) EvalHook                           
(VERY_LOW    ) TextLoggerHook                     
 -------------------- 
before_val_epoch:
(LOW         ) IterTimerHook                      
(VERY_LOW    ) TextLoggerHook                     
 -------------------- 
before_val_iter:
(LOW         ) IterTimerHook                      
 -------------------- 
after_val_iter:
(LOW         ) IterTimerHook                      
 -------------------- 
after_val_epoch:
(VERY_LOW    ) TextLoggerHook                     
 -------------------- 
2021-08-17 11:24:33,803 - mmtrack - INFO - workflow: [('train', 1)], max: 6 epochs
/pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:59: ClassNLLCriterion_updateOutput_no_reduce_kernel: block: [0,0,0], thread: [44,0,0] Assertion `cur_target >= 0 && cur_target < n_classes` failed.
/pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:59: ClassNLLCriterion_updateOutput_no_reduce_kernel: block: [0,0,0], thread: [45,0,0] Assertion `cur_target >= 0 && cur_target < n_classes` failed.
/pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:59: ClassNLLCriterion_updateOutput_no_reduce_kernel: block: [0,0,0], thread: [46,0,0] Assertion `cur_target >= 0 && cur_target < n_classes` failed.
/pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:59: ClassNLLCriterion_updateOutput_no_reduce_kernel: block: [0,0,0], thread: [47,0,0] Assertion `cur_target >= 0 && cur_target < n_classes` failed.
Traceback (most recent call last):
  File "./tools/train.py", line 174, in <module>
    main()
  File "./tools/train.py", line 163, in main
    train_model(
  File "/home2/qljx17/Open-MMLab/mmtracking/mmtrack/apis/train.py", line 136, in train_model
    runner.run(data_loaders, cfg.workflow, cfg.total_epochs)
  File "/home2/qljx17/Open-MMLab/evenv/lib/python3.8/site-packages/mmcv/runner/epoch_based_runner.py", line 127, in run
    epoch_runner(data_loaders[i], **kwargs)
  File "/home2/qljx17/Open-MMLab/evenv/lib/python3.8/site-packages/mmcv/runner/epoch_based_runner.py", line 50, in train
    self.run_iter(data_batch, train_mode=True, **kwargs)
  File "/home2/qljx17/Open-MMLab/evenv/lib/python3.8/site-packages/mmcv/runner/epoch_based_runner.py", line 29, in run_iter
    outputs = self.model.train_step(data_batch, self.optimizer,
  File "/home2/qljx17/Open-MMLab/evenv/lib/python3.8/site-packages/mmcv/parallel/data_parallel.py", line 67, in train_step
    return self.module.train_step(*inputs[0], **kwargs[0])
  File "/home2/qljx17/Open-MMLab/mmclassification/mmcls/models/classifiers/base.py", line 146, in train_step
    loss, log_vars = self._parse_losses(losses)
  File "/home2/qljx17/Open-MMLab/mmclassification/mmcls/models/classifiers/base.py", line 97, in _parse_losses
    log_vars[loss_name] = loss_value.mean()
RuntimeError: CUDA error: device-side assert triggered
terminate called after throwing an instance of 'c10::Error'
  what():  CUDA error: device-side assert triggered
Exception raised from create_event_internal at /pytorch/c10/cuda/CUDACachingAllocator.cpp:687 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7fc1479138b2 in /home2/qljx17/Open-MMLab/evenv/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0xad2 (0x7fc147b65952 in /home2/qljx17/Open-MMLab/evenv/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10::TensorImpl::release_resources() + 0x4d (0x7fc1478feb7d in /home2/qljx17/Open-MMLab/evenv/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #3: <unknown function> + 0x5fd7a2 (0x7fc1920fb7a2 in /home2/qljx17/Open-MMLab/evenv/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #4: <unknown function> + 0x5fd856 (0x7fc1920fb856 in /home2/qljx17/Open-MMLab/evenv/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #5: python3() [0x534ce6]
frame #6: python3() [0x51c5d9]
frame #7: python3() [0x52cb15]
frame #8: python3() [0x52cb15]
frame #9: python3() [0x500a2e]
frame #10: python3() [0x57d905]
frame #11: python3() [0x57d8bb]
frame #12: python3() [0x57d8bb]
frame #13: python3() [0x57d8bb]
frame #14: python3() [0x57d8bb]
frame #15: python3() [0x57d8bb]
frame #16: python3() [0x57d8bb]
frame #17: python3() [0x5f25e6]
<omitting python frames>
frame #23: __libc_start_main + 0xf3 (0x7fc1a2ef10b3 in /lib/x86_64-linux-gnu/libc.so.6)

/var/spool/slurmd/job128755/slurm_script: line 21: 3941330 Aborted                 (core dumped) python3 ./tools/train.py configs/reid/resnet50_b32x8_MOT17.py --work-dir work_dirs/resnet50_b32x8_MOT17
^Z

Bug fix From the error above, I can assume that its because of the number of classes. From the default config, num of class is being set as 378, which is taken from train_80.txt, hence the error appear. However, when I set the num of class as 512, which is the number of samples in imgs folder, Im able to run the training without any error. Is there something that I missed, or the number of classes could be the main problem here?

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:8 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
GT9505commented, Sep 2, 2021

Hi, @yonafalinie , It’s a bug introduced from tools/convert_datasets/mot2reid.py. The script may generate different train_80.txt and val_20.txt in different machines. We have fixed it in #249 .

0reactions
yonafaliniecommented, Aug 23, 2021

Hi, most welcome, although I am not sure if its really a bug or maybe my slight modification causes the error.

Read more comments on GitHub >

github_iconTop Results From Across the Web

John E. Reid and Associates, Inc.
Membership in the Reid Institute is open to all government, law enforcement, and private security investigators who share our commitment to learning and ......
Read more >
Reid technique - Wikipedia
The technique is known for creating a high pressure environment for the interviewee, followed by sympathy and offers of understanding and help, but...
Read more >
Basic REID and Advanced REID Training - State of Michigan
Basic REID and Advanced REID Training · Interview and Interrogation preparation · Behavior Symptom Analysis · REID Behavioral Analysis Interview TM · The...
Read more >
INTERROGATION TECHNIQUES
The Reid Technique involves three components – factual analysis, interviewing, and interrogation. Following is a brief summary of these components; more ...
Read more >
Interviewing & Interrogation The Reid Technique - Morris County
COURSE DESCRIPTION: This single 4-day program is our most comprehensive program on THE REID TECHNIQUE® process. We have integrated all of our material...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found