train ssd300 on VOC
See original GitHub issuepython train.py --work_dir ‘/home/hs/hs/237014845/HuaWei/mmdetection-master/weights’ --seed 100 ‘/home/hs/hs/237014845/HuaWei/mmdetection-master/configs/pascal_voc/ssd300_voc.py’ 2019-03-13 09:43:47,761 - INFO - Distributed training: False 2019-03-13 09:43:47,761 - INFO - Set random seed to 100 2019-03-13 09:43:48,000 - INFO - load model from: open-mmlab://vgg16_caffe 2019-03-13 09:43:48,050 - WARNING - missing keys in source state_dict: extra.4.weight, extra.7.weight, extra.1.bias, extra.1.weight, l2_norm.weight, extra.2.bias, extra.7.bias, extra.4.bias, extra.0.bias, extra.3.bias, extra.0.weight, extra.5.bias, extra.2.weight, extra.6.weight, extra.3.weight, extra.5.weight, extra.6.bias
2019-03-13 09:43:50,310 - INFO - Start running, host: hs@hs-System-Product-Name, work_dir: /home/hs/hs/237014845/HuaWei/mmdetection-master/weights
2019-03-13 09:43:50,311 - INFO - workflow: [(‘train’, 1)], max: 24 epochs
2019-03-13 09:44:16,016 - INFO - Epoch [1][50/41378] lr: 0.00100, eta: 5 days, 21:48:20, time: 0.514, data_time: 0.006, loss_cls: 19.5927, loss_reg: 3.8320, loss: 23.4247
/opt/conda/conda-bld/pytorch_1549628766161/work/aten/src/THC/THCTensorScatterGather.cu:124: void THCudaTensor_scatterKernel(TensorInfo<Real, IndexType>, TensorInfo<Real, IndexType>, TensorInfo<long, IndexType>, int, IndexType) [with IndexType = unsigned int, Real = float, Dims = 1]: block: [0,0,0], thread: [0,0,0] Assertion indexValue >= 0 && indexValue < tensor.sizes[dim]
failed.
THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1549628766161/work/aten/src/THC/generated/…/THCReduceAll.cuh line=317 error=59 : device-side assert triggered
Traceback (most recent call last):
File “train.py”, line 90, in <module>
main()
File “train.py”, line 86, in main
logger=logger)
File “/home/hs/anaconda3/lib/python3.6/site-packages/mmdet-0.6rc0+unknown-py3.6.egg/mmdet/apis/train.py”, line 59, in train_detector
_non_dist_train(model, dataset, cfg, validate=validate)
File “/home/hs/anaconda3/lib/python3.6/site-packages/mmdet-0.6rc0+unknown-py3.6.egg/mmdet/apis/train.py”, line 121, in _non_dist_train
runner.run(data_loaders, cfg.workflow, cfg.total_epochs)
File “/home/hs/anaconda3/lib/python3.6/site-packages/mmcv/runner/runner.py”, line 355, in run
epoch_runner(data_loaders[i], *kwargs)
File “/home/hs/anaconda3/lib/python3.6/site-packages/mmcv/runner/runner.py”, line 268, in train
self.call_hook(‘after_train_iter’)
File “/home/hs/anaconda3/lib/python3.6/site-packages/mmcv/runner/runner.py”, line 228, in call_hook
getattr(hook, fn_name)(self)
File “/home/hs/anaconda3/lib/python3.6/site-packages/mmcv/runner/hooks/optimizer.py”, line 17, in after_train_iter
runner.outputs[‘loss’].backward()
File “/home/hs/anaconda3/lib/python3.6/site-packages/torch/tensor.py”, line 102, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File “/home/hs/anaconda3/lib/python3.6/site-packages/torch/autograd/init.py”, line 90, in backward
allow_unreachable=True) # allow_unreachable flag
RuntimeError: cuda runtime error (59) : device-side assert triggered at /opt/conda/conda-bld/pytorch_1549628766161/work/aten/src/THC/generated/…/THCReduceAll.cuh:317
terminate called after throwing an instance of ‘c10::Error’
what(): CUDA error: device-side assert triggered (insert_events at /opt/conda/conda-bld/pytorch_1549628766161/work/aten/src/THC/THCCachingAllocator.cpp:470)
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x45 (0x7fb752f6ccf5 in /home/hs/anaconda3/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x122a0d0 (0x7fb75723f0d0 in /home/hs/anaconda3/lib/python3.6/site-packages/torch/lib/libcaffe2_gpu.so)
frame #2: at::TensorImpl::release_resources() + 0x50 (0x7fb7536d8c30 in /home/hs/anaconda3/lib/python3.6/site-packages/torch/lib/libcaffe2.so)
frame #3: <unknown function> + 0x2a836b (0x7fb750cea36b in /home/hs/anaconda3/lib/python3.6/site-packages/torch/lib/libtorch.so.1)
frame #4: <unknown function> + 0x30eff0 (0x7fb750d50ff0 in /home/hs/anaconda3/lib/python3.6/site-packages/torch/lib/libtorch.so.1)
frame #5: torch::autograd::deleteFunction(torch::autograd::Function) + 0x2f0 (0x7fb750cecd70 in /home/hs/anaconda3/lib/python3.6/site-packages/torch/lib/libtorch.so.1)
frame #6: std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() + 0x45 (0x7fb7741887f5 in /home/hs/anaconda3/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #7: torch::autograd::Variable::Impl::release_resources() + 0x4a (0x7fb750f5f1ba in /home/hs/anaconda3/lib/python3.6/site-packages/torch/lib/libtorch.so.1)
frame #8: <unknown function> + 0x12148b (0x7fb7741a048b in /home/hs/anaconda3/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #9: <unknown function> + 0x31a49f (0x7fb77439949f in /home/hs/anaconda3/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #10: <unknown function> + 0x31a4e1 (0x7fb7743994e1 in /home/hs/anaconda3/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
<omitting python frames>
frame #26: __libc_start_main + 0xf0 (0x7fb78ff0f830 in /lib/x86_64-linux-gnu/libc.so.6)
Aborted (core dumped)
Issue Analytics
- State:
- Created 5 years ago
- Comments:5
Top GitHub Comments
I decrease lr,but it does not work.
You should decrease
lr
if you train the model on a single card.