Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

assert len(indices) == self.total_size error during multiple GPU training

See original GitHub issue

I am trying to train my dataset on 8 GPU’s. However, after calling ./dist_train.sh this error assertion appeares:

Traceback (most recent call last):
File “./tools/train.py”, line 113, in <module>
main()
File “./tools/train.py”, line 109, in main
logger=logger)
File “/mmdetection/mmdet/apis/train.py”, line 58, in train_detector
_dist_train(model, dataset, cfg, validate=validate)
File “/mmdetection/mmdet/apis/train.py”, line 186, in _dist_train
runner.run(data_loaders, cfg.workflow, cfg.total_epochs)
File “/opt/conda/lib/python3.6/site-packages/mmcv/runner/runner.py”, line 358, in run epoch_runner(data_loaders[i], **kwargs)
File “/opt/conda/lib/python3.6/site-packages/mmcv/runner/runner.py”, line 260, in train for i, data_batch in enumerate(data_loader):
File “/opt/conda/lib/python3.6/site-packages/torch/utils/data/dataloader.py”, line 193, in iter return _DataLoaderIter(self)
File “/opt/conda/lib/python3.6/site-packages/torch/utils/data/dataloader.py”, line 493, in init self._put_indices()
File “/opt/conda/lib/python3.6/site-packages/torch/utils/data/dataloader.py”, line 591, in _put_indices indices = next(self.sample_iter, None)
File “/opt/conda/lib/python3.6/site-packages/torch/utils/data/sampler.py”, line 172, in iter for idx in self.sampler:
File “/mmdetection/mmdet/datasets/loader/sampler.py”, line 138, in iter
assert len(indices) == self.total_size
…

in the config I tried various values for imgs_per_gpu and workers_per_gpu, currently it is: imgs_per_gpu=2, workers_per_gpu=2, no settings was working though. Single-GPU training works well.

What is the meaning of this assert? Thanks!

Issue Analytics

State:
Created 4 years ago
Comments:13 (3 by maintainers)

Top GitHub Comments

1reaction

ZhexuanZhoucommented, Aug 23, 2019

I meet the same issue, how to fix it?

0reactions

MyLtYkRiTiKcommented, Sep 13, 2019

Than I deleted w>h pics and get another error: TypeError: ‘NoneType’ object is not subscriptable

Top Results From Across the Web

Muti-GPU Training - RuntimeError: one of the variables ...

I am using torch 1.8.0+cu11 on NVIDIA A6000 GPUs. ... W assert L == H * W, "input feature has wrong size" shortcut...

CUDA C++ Programming Guide - NVIDIA Documentation Center

The index of a thread and its thread ID relate to each other in a ... The CUDA memory consistency model guarantees that...

Source code for transformers.trainer - Hugging Face

TrainingArguments`, `optional`): The arguments to tweak for training. ... self.optimizer, opt_level=self.args.fp16_opt_level) # multi-gpu training (should ...

Pytorch: IndexError: index out of range in self. How to solve?

Any input less than zero or more than declared input dimension raise this error. Compare your input and the dimension mentioned in torch.nn....

gluoncv.auto.data.dataset — AutoGluon Documentation 0.4.2 ...

LABEL_COL]) for idx in indices if idx < len(df)] _show_images(images, cols=ncol, titles=titles, fontsize=fontsize) def to_mxnet(self): """Return a mxnet ...