question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

assert len(indices) == self.total_size error during multiple GPU training

See original GitHub issue

I am trying to train my dataset on 8 GPU’s. However, after calling ./dist_train.sh this error assertion appeares:

Traceback (most recent call last):
File “./tools/train.py”, line 113, in <module>
main()
File “./tools/train.py”, line 109, in main
logger=logger)
File “/mmdetection/mmdet/apis/train.py”, line 58, in train_detector
_dist_train(model, dataset, cfg, validate=validate)
File “/mmdetection/mmdet/apis/train.py”, line 186, in _dist_train
runner.run(data_loaders, cfg.workflow, cfg.total_epochs)
File “/opt/conda/lib/python3.6/site-packages/mmcv/runner/runner.py”, line 358, in run epoch_runner(data_loaders[i], **kwargs)
File “/opt/conda/lib/python3.6/site-packages/mmcv/runner/runner.py”, line 260, in train for i, data_batch in enumerate(data_loader):
File “/opt/conda/lib/python3.6/site-packages/torch/utils/data/dataloader.py”, line 193, in iter return _DataLoaderIter(self)
File “/opt/conda/lib/python3.6/site-packages/torch/utils/data/dataloader.py”, line 493, in init self._put_indices()
File “/opt/conda/lib/python3.6/site-packages/torch/utils/data/dataloader.py”, line 591, in _put_indices indices = next(self.sample_iter, None)
File “/opt/conda/lib/python3.6/site-packages/torch/utils/data/sampler.py”, line 172, in iter for idx in self.sampler:
File “/mmdetection/mmdet/datasets/loader/sampler.py”, line 138, in iter
assert len(indices) == self.total_size

in the config I tried various values for imgs_per_gpu and workers_per_gpu, currently it is: imgs_per_gpu=2, workers_per_gpu=2, no settings was working though. Single-GPU training works well.

What is the meaning of this assert? Thanks!

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:13 (3 by maintainers)

github_iconTop GitHub Comments

1reaction
ZhexuanZhoucommented, Aug 23, 2019

I meet the same issue, how to fix it?

0reactions
MyLtYkRiTiKcommented, Sep 13, 2019

Than I deleted w>h pics and get another error: TypeError: ‘NoneType’ object is not subscriptable

Read more comments on GitHub >

github_iconTop Results From Across the Web

Muti-GPU Training - RuntimeError: one of the variables ...
I am using torch 1.8.0+cu11 on NVIDIA A6000 GPUs. ... W assert L == H * W, "input feature has wrong size" shortcut...
Read more >
CUDA C++ Programming Guide - NVIDIA Documentation Center
The index of a thread and its thread ID relate to each other in a ... The CUDA memory consistency model guarantees that...
Read more >
Source code for transformers.trainer - Hugging Face
TrainingArguments`, `optional`): The arguments to tweak for training. ... self.optimizer, opt_level=self.args.fp16_opt_level) # multi-gpu training (should ...
Read more >
Pytorch: IndexError: index out of range in self. How to solve?
Any input less than zero or more than declared input dimension raise this error. Compare your input and the dimension mentioned in torch.nn....
Read more >
gluoncv.auto.data.dataset — AutoGluon Documentation 0.4.2 ...
LABEL_COL]) for idx in indices if idx < len(df)] _show_images(images, cols=ncol, titles=titles, fontsize=fontsize) def to_mxnet(self): """Return a mxnet ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found