question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

pred_label and label mismatch errors occur when using multi-node multi-gpus

See original GitHub issue

I use ‘‘gpu_collect’’ since my multi-node machines have no shared storage. I run the code on multi-node multi-gpus, the training is good, but when evaluation, error happens.

mask: torch.Size([360, 530])
pred_label: torch.Size([256, 256])
Traceback (most recent call last):
  File "tools/train.py", line 166, in <module>
    main()
  File "tools/train.py", line 162, in main
    meta=meta)
  File "/opt/mmsegmentation/mmseg/apis/train.py", line 116, in train_segmentor
    runner.run(data_loaders, cfg.workflow)
  File "/home/.local/lib/python3.7/site-packages/mmcv/runner/iter_based_runner.py", line 133, in run
    iter_runner(iter_loaders[i], **kwargs)
  File "/home/.local/lib/python3.7/site-packages/mmcv/runner/iter_based_runner.py", line 66, in train
    self.call_hook('after_train_iter')
  File "/home/.local/lib/python3.7/site-packages/mmcv/runner/base_runner.py", line 307, in call_hook
    getattr(hook, fn_name)(self)
  File "/home/.local/lib/python3.7/site-packages/mmcv/runner/hooks/evaluation.py", line 215, in after_train_iter
    self._do_evaluate(runner)
  File "/opt/mmsegmentation/mmseg/core/evaluation/eval_hooks.py", line 96, in _do_evaluate
    key_score = self.evaluate(runner, results)
  File "/home/.local/lib/python3.7/site-packages/mmcv/runner/hooks/evaluation.py", line 311, in evaluate
    results, logger=runner.logger, **self.eval_kwargs)
  File "/opt/mmsegmentation/mmseg/datasets/custom.py", line 344, in evaluate
    reduce_zero_label=self.reduce_zero_label)
  File "/opt/mmsegmentation/mmseg/core/evaluation/metrics.py", line 298, in eval_metrics
    reduce_zero_label)
  File "/opt/mmsegmentation/mmseg/core/evaluation/metrics.py", line 129, in total_intersect_and_union
    label_map, reduce_zero_label)
  File "/opt/mmsegmentation/mmseg/core/evaluation/metrics.py", line 79, in intersect_and_union
    pred_label = pred_label[mask]
IndexError: The shape of the mask [360, 530] at index 0 does not match the shape of the indexed tensor [256, 256] at index 0
Traceback (most recent call last):
  File "/usr/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/usr/local/lib/python3.7/dist-packages/torch/distributed/launch.py", line 260, in <module>
    main()
  File "/usr/local/lib/python3.7/dist-packages/torch/distributed/launch.py", line 256, in main
    cmd=cmd)
subprocess.CalledProcessError: Command '['/usr/bin/python3', '-u', 'tools/train.py', '--local_rank=7', 'configs/swin/upernet_swin_tiny_patch4_window7_512x512_160k_ade20k_pretrain_224x224_1K.py', '--launcher', 'pytorch']' returned non-zero exit status 1.

I guess this error is caused by the order mismatch of pred_label and label, referring to pr522. But this pr seems invalid.

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:15 (9 by maintainers)

github_iconTop GitHub Comments

2reactions
PeizeSuncommented, Aug 12, 2021

Hi @PeizeSun Please have a try #780. Looking forward to your feedback.

It works ! Great !

Thanks for your hard work !

0reactions
PapaMadeleine2022commented, Sep 30, 2022

@PeizeSun @Junjun2016 您好 我也遇到了这个问题 请问应该在修改哪些代码? 谢谢~

Read more comments on GitHub >

github_iconTop Results From Across the Web

Multi-Node Multi-GPU Diffeomorphic Image Registration for ...
We propose a multi-node multi-GPU framework with high computational throughput for single (large-scale) registration problems.
Read more >
Efficient Training on Multiple GPUs - Hugging Face
When training on a single GPU is too slow or the model weights don't fit in a single GPUs memory we use a...
Read more >
Troubleshooting kubeadm | Kubernetes
Troubleshooting kubeadm. As with any program, you might run into an error installing or running kubeadm. This page lists some common failure ...
Read more >
Troubleshoot SageMaker Clarify Processing Jobs
Bias metric computation fails for several or all metrics. If your receive one of the following error messages "No Label values are present...
Read more >
Nvidia NVML Driver/library version mismatch - Stack Overflow
I am using Ubuntu and I think error occurs after Nvidia driver is updated on Linux. Maybe auto-remove and reboot is required after...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found