question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

train ssd300 on VOC

See original GitHub issue

python train.py --work_dir ‘/home/hs/hs/237014845/HuaWei/mmdetection-master/weights’ --seed 100 ‘/home/hs/hs/237014845/HuaWei/mmdetection-master/configs/pascal_voc/ssd300_voc.py’ 2019-03-13 09:43:47,761 - INFO - Distributed training: False 2019-03-13 09:43:47,761 - INFO - Set random seed to 100 2019-03-13 09:43:48,000 - INFO - load model from: open-mmlab://vgg16_caffe 2019-03-13 09:43:48,050 - WARNING - missing keys in source state_dict: extra.4.weight, extra.7.weight, extra.1.bias, extra.1.weight, l2_norm.weight, extra.2.bias, extra.7.bias, extra.4.bias, extra.0.bias, extra.3.bias, extra.0.weight, extra.5.bias, extra.2.weight, extra.6.weight, extra.3.weight, extra.5.weight, extra.6.bias

2019-03-13 09:43:50,310 - INFO - Start running, host: hs@hs-System-Product-Name, work_dir: /home/hs/hs/237014845/HuaWei/mmdetection-master/weights 2019-03-13 09:43:50,311 - INFO - workflow: [(‘train’, 1)], max: 24 epochs 2019-03-13 09:44:16,016 - INFO - Epoch [1][50/41378] lr: 0.00100, eta: 5 days, 21:48:20, time: 0.514, data_time: 0.006, loss_cls: 19.5927, loss_reg: 3.8320, loss: 23.4247 /opt/conda/conda-bld/pytorch_1549628766161/work/aten/src/THC/THCTensorScatterGather.cu:124: void THCudaTensor_scatterKernel(TensorInfo<Real, IndexType>, TensorInfo<Real, IndexType>, TensorInfo<long, IndexType>, int, IndexType) [with IndexType = unsigned int, Real = float, Dims = 1]: block: [0,0,0], thread: [0,0,0] Assertion indexValue >= 0 && indexValue < tensor.sizes[dim] failed. THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1549628766161/work/aten/src/THC/generated/…/THCReduceAll.cuh line=317 error=59 : device-side assert triggered Traceback (most recent call last): File “train.py”, line 90, in <module> main() File “train.py”, line 86, in main logger=logger) File “/home/hs/anaconda3/lib/python3.6/site-packages/mmdet-0.6rc0+unknown-py3.6.egg/mmdet/apis/train.py”, line 59, in train_detector _non_dist_train(model, dataset, cfg, validate=validate) File “/home/hs/anaconda3/lib/python3.6/site-packages/mmdet-0.6rc0+unknown-py3.6.egg/mmdet/apis/train.py”, line 121, in _non_dist_train runner.run(data_loaders, cfg.workflow, cfg.total_epochs) File “/home/hs/anaconda3/lib/python3.6/site-packages/mmcv/runner/runner.py”, line 355, in run epoch_runner(data_loaders[i], *kwargs) File “/home/hs/anaconda3/lib/python3.6/site-packages/mmcv/runner/runner.py”, line 268, in train self.call_hook(‘after_train_iter’) File “/home/hs/anaconda3/lib/python3.6/site-packages/mmcv/runner/runner.py”, line 228, in call_hook getattr(hook, fn_name)(self) File “/home/hs/anaconda3/lib/python3.6/site-packages/mmcv/runner/hooks/optimizer.py”, line 17, in after_train_iter runner.outputs[‘loss’].backward() File “/home/hs/anaconda3/lib/python3.6/site-packages/torch/tensor.py”, line 102, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph) File “/home/hs/anaconda3/lib/python3.6/site-packages/torch/autograd/init.py”, line 90, in backward allow_unreachable=True) # allow_unreachable flag RuntimeError: cuda runtime error (59) : device-side assert triggered at /opt/conda/conda-bld/pytorch_1549628766161/work/aten/src/THC/generated/…/THCReduceAll.cuh:317 terminate called after throwing an instance of ‘c10::Error’ what(): CUDA error: device-side assert triggered (insert_events at /opt/conda/conda-bld/pytorch_1549628766161/work/aten/src/THC/THCCachingAllocator.cpp:470) frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x45 (0x7fb752f6ccf5 in /home/hs/anaconda3/lib/python3.6/site-packages/torch/lib/libc10.so) frame #1: <unknown function> + 0x122a0d0 (0x7fb75723f0d0 in /home/hs/anaconda3/lib/python3.6/site-packages/torch/lib/libcaffe2_gpu.so) frame #2: at::TensorImpl::release_resources() + 0x50 (0x7fb7536d8c30 in /home/hs/anaconda3/lib/python3.6/site-packages/torch/lib/libcaffe2.so) frame #3: <unknown function> + 0x2a836b (0x7fb750cea36b in /home/hs/anaconda3/lib/python3.6/site-packages/torch/lib/libtorch.so.1) frame #4: <unknown function> + 0x30eff0 (0x7fb750d50ff0 in /home/hs/anaconda3/lib/python3.6/site-packages/torch/lib/libtorch.so.1) frame #5: torch::autograd::deleteFunction(torch::autograd::Function) + 0x2f0 (0x7fb750cecd70 in /home/hs/anaconda3/lib/python3.6/site-packages/torch/lib/libtorch.so.1) frame #6: std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() + 0x45 (0x7fb7741887f5 in /home/hs/anaconda3/lib/python3.6/site-packages/torch/lib/libtorch_python.so) frame #7: torch::autograd::Variable::Impl::release_resources() + 0x4a (0x7fb750f5f1ba in /home/hs/anaconda3/lib/python3.6/site-packages/torch/lib/libtorch.so.1) frame #8: <unknown function> + 0x12148b (0x7fb7741a048b in /home/hs/anaconda3/lib/python3.6/site-packages/torch/lib/libtorch_python.so) frame #9: <unknown function> + 0x31a49f (0x7fb77439949f in /home/hs/anaconda3/lib/python3.6/site-packages/torch/lib/libtorch_python.so) frame #10: <unknown function> + 0x31a4e1 (0x7fb7743994e1 in /home/hs/anaconda3/lib/python3.6/site-packages/torch/lib/libtorch_python.so) <omitting python frames> frame #26: __libc_start_main + 0xf0 (0x7fb78ff0f830 in /lib/x86_64-linux-gnu/libc.so.6)

Aborted (core dumped)

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Comments:5

github_iconTop GitHub Comments

2reactions
wizholycommented, Mar 22, 2019

I decrease lr,but it does not work.

1reaction
yhcao6commented, Mar 13, 2019

You should decrease lr if you train the model on a single card.

Read more comments on GitHub >

github_iconTop Results From Across the Web

ssd_keras/ssd300_training.ipynb at master - GitHub
This tutorial explains how to train an SSD300 on the Pascal VOC datasets. The preset parameters reproduce the training of the original SSD300...
Read more >
04. Train SSD on Pascal VOC dataset
Please first go through this Prepare PASCAL VOC datasets tutorial to setup Pascal VOC dataset on your disk. Then, we are ready to...
Read more >
Object Detection using PyTorch and SSD300 with VGG16 ...
Use SSD300 deep learning object detector with PyTorch deep learning framework for object detection in images and videos.
Read more >
Object detection using SSD300-RN34 on Pascal VOC. (a) The...
Fig. 10(b) shows the EPI curve for this experiment. As shown, the EPI value increases rapidly at the early stage of training and...
Read more >
ssd300 — OpenVINO™ documentation — Version(latest)
The ssd300 model is the Caffe* framework implementation of Single-Shot multibox Detection ... Vehicle: aeroplane, bicycle, boat, bus, car, motorbike, train.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found