Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

pred_label and label mismatch errors occur when using multi-node multi-gpus

See original GitHub issue

I use ‘‘gpu_collect’’ since my multi-node machines have no shared storage. I run the code on multi-node multi-gpus, the training is good, but when evaluation, error happens.

mask: torch.Size([360, 530])
pred_label: torch.Size([256, 256])
Traceback (most recent call last):
  File "tools/train.py", line 166, in <module>
    main()
  File "tools/train.py", line 162, in main
    meta=meta)
  File "/opt/mmsegmentation/mmseg/apis/train.py", line 116, in train_segmentor
    runner.run(data_loaders, cfg.workflow)
  File "/home/.local/lib/python3.7/site-packages/mmcv/runner/iter_based_runner.py", line 133, in run
    iter_runner(iter_loaders[i], **kwargs)
  File "/home/.local/lib/python3.7/site-packages/mmcv/runner/iter_based_runner.py", line 66, in train
    self.call_hook('after_train_iter')
  File "/home/.local/lib/python3.7/site-packages/mmcv/runner/base_runner.py", line 307, in call_hook
    getattr(hook, fn_name)(self)
  File "/home/.local/lib/python3.7/site-packages/mmcv/runner/hooks/evaluation.py", line 215, in after_train_iter
    self._do_evaluate(runner)
  File "/opt/mmsegmentation/mmseg/core/evaluation/eval_hooks.py", line 96, in _do_evaluate
    key_score = self.evaluate(runner, results)
  File "/home/.local/lib/python3.7/site-packages/mmcv/runner/hooks/evaluation.py", line 311, in evaluate
    results, logger=runner.logger, **self.eval_kwargs)
  File "/opt/mmsegmentation/mmseg/datasets/custom.py", line 344, in evaluate
    reduce_zero_label=self.reduce_zero_label)
  File "/opt/mmsegmentation/mmseg/core/evaluation/metrics.py", line 298, in eval_metrics
    reduce_zero_label)
  File "/opt/mmsegmentation/mmseg/core/evaluation/metrics.py", line 129, in total_intersect_and_union
    label_map, reduce_zero_label)
  File "/opt/mmsegmentation/mmseg/core/evaluation/metrics.py", line 79, in intersect_and_union
    pred_label = pred_label[mask]
IndexError: The shape of the mask [360, 530] at index 0 does not match the shape of the indexed tensor [256, 256] at index 0
Traceback (most recent call last):
  File "/usr/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/usr/local/lib/python3.7/dist-packages/torch/distributed/launch.py", line 260, in <module>
    main()
  File "/usr/local/lib/python3.7/dist-packages/torch/distributed/launch.py", line 256, in main
    cmd=cmd)
subprocess.CalledProcessError: Command '['/usr/bin/python3', '-u', 'tools/train.py', '--local_rank=7', 'configs/swin/upernet_swin_tiny_patch4_window7_512x512_160k_ade20k_pretrain_224x224_1K.py', '--launcher', 'pytorch']' returned non-zero exit status 1.

I guess this error is caused by the order mismatch of pred_label and label, referring to pr522. But this pr seems invalid.