pred_label and label mismatch errors occur when using multi-node multi-gpus
See original GitHub issueI use ‘‘gpu_collect’’ since my multi-node machines have no shared storage. I run the code on multi-node multi-gpus, the training is good, but when evaluation, error happens.
mask: torch.Size([360, 530])
pred_label: torch.Size([256, 256])
Traceback (most recent call last):
File "tools/train.py", line 166, in <module>
main()
File "tools/train.py", line 162, in main
meta=meta)
File "/opt/mmsegmentation/mmseg/apis/train.py", line 116, in train_segmentor
runner.run(data_loaders, cfg.workflow)
File "/home/.local/lib/python3.7/site-packages/mmcv/runner/iter_based_runner.py", line 133, in run
iter_runner(iter_loaders[i], **kwargs)
File "/home/.local/lib/python3.7/site-packages/mmcv/runner/iter_based_runner.py", line 66, in train
self.call_hook('after_train_iter')
File "/home/.local/lib/python3.7/site-packages/mmcv/runner/base_runner.py", line 307, in call_hook
getattr(hook, fn_name)(self)
File "/home/.local/lib/python3.7/site-packages/mmcv/runner/hooks/evaluation.py", line 215, in after_train_iter
self._do_evaluate(runner)
File "/opt/mmsegmentation/mmseg/core/evaluation/eval_hooks.py", line 96, in _do_evaluate
key_score = self.evaluate(runner, results)
File "/home/.local/lib/python3.7/site-packages/mmcv/runner/hooks/evaluation.py", line 311, in evaluate
results, logger=runner.logger, **self.eval_kwargs)
File "/opt/mmsegmentation/mmseg/datasets/custom.py", line 344, in evaluate
reduce_zero_label=self.reduce_zero_label)
File "/opt/mmsegmentation/mmseg/core/evaluation/metrics.py", line 298, in eval_metrics
reduce_zero_label)
File "/opt/mmsegmentation/mmseg/core/evaluation/metrics.py", line 129, in total_intersect_and_union
label_map, reduce_zero_label)
File "/opt/mmsegmentation/mmseg/core/evaluation/metrics.py", line 79, in intersect_and_union
pred_label = pred_label[mask]
IndexError: The shape of the mask [360, 530] at index 0 does not match the shape of the indexed tensor [256, 256] at index 0
Traceback (most recent call last):
File "/usr/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/usr/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/usr/local/lib/python3.7/dist-packages/torch/distributed/launch.py", line 260, in <module>
main()
File "/usr/local/lib/python3.7/dist-packages/torch/distributed/launch.py", line 256, in main
cmd=cmd)
subprocess.CalledProcessError: Command '['/usr/bin/python3', '-u', 'tools/train.py', '--local_rank=7', 'configs/swin/upernet_swin_tiny_patch4_window7_512x512_160k_ade20k_pretrain_224x224_1K.py', '--launcher', 'pytorch']' returned non-zero exit status 1.
I guess this error is caused by the order mismatch of pred_label and label, referring to pr522. But this pr seems invalid.
Issue Analytics
- State:
- Created 2 years ago
- Comments:15 (9 by maintainers)
Top Results From Across the Web
Multi-Node Multi-GPU Diffeomorphic Image Registration for ...
We propose a multi-node multi-GPU framework with high computational throughput for single (large-scale) registration problems.
Read more >Efficient Training on Multiple GPUs - Hugging Face
When training on a single GPU is too slow or the model weights don't fit in a single GPUs memory we use a...
Read more >Troubleshooting kubeadm | Kubernetes
Troubleshooting kubeadm. As with any program, you might run into an error installing or running kubeadm. This page lists some common failure ...
Read more >Troubleshoot SageMaker Clarify Processing Jobs
Bias metric computation fails for several or all metrics. If your receive one of the following error messages "No Label values are present...
Read more >Nvidia NVML Driver/library version mismatch - Stack Overflow
I am using Ubuntu and I think error occurs after Nvidia driver is updated on Linux. Maybe auto-remove and reboot is required after...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
It works ! Great !
Thanks for your hard work !
@PeizeSun @Junjun2016 您好 我也遇到了这个问题 请问应该在修改哪些代码? 谢谢~