distributed training compatible issue in ignite 0.4.2
See original GitHub issue❓ Questions/Help/Support
Hi @vfdev-5 ,
I am trying to upgrade ignite to v0.4.2 in MONAI, got error when I ran this test program of MONAI: https://github.com/Project-MONAI/MONAI/blob/master/tests/test_handler_rocauc_dist.py I used 2 GPU in 1 node, and it passed in ignite v0.3.0 before. Here is the error log:
root@apt-sh-ai:/workspace/data/medical/MONAI# python -m torch.distributed.launch --nproc_per_node=2 --nnodes=1 --node_rank=0 --master_addr="10.23.137.29" --master_port=1234 tests/test_handler_rocauc_dist.py
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
*****************************************
Traceback (most recent call last):
Traceback (most recent call last):
File "tests/test_handler_rocauc_dist.py", line 48, in <module>
File "tests/test_handler_rocauc_dist.py", line 48, in <module>
main()
main()
File "tests/test_handler_rocauc_dist.py", line 23, in main
File "tests/test_handler_rocauc_dist.py", line 23, in main
auc_metric = ROCAUC(to_onehot_y=True, softmax=True)
auc_metric = ROCAUC(to_onehot_y=True, softmax=True)
File "/workspace/data/medical/MONAI/monai/handlers/roc_auc.py", line 66, in __init__
File "/workspace/data/medical/MONAI/monai/handlers/roc_auc.py", line 66, in __init__
super().__init__(output_transform, device=device)
File "/opt/conda/lib/python3.6/site-packages/ignite/metrics/metric.py", line 200, in __init__
super().__init__(output_transform, device=device)
File "/opt/conda/lib/python3.6/site-packages/ignite/metrics/metric.py", line 200, in __init__
if idist.get_world_size() > 1:
File "/opt/conda/lib/python3.6/site-packages/ignite/distributed/utils.py", line 133, in get_world_size
if idist.get_world_size() > 1:
File "/opt/conda/lib/python3.6/site-packages/ignite/distributed/utils.py", line 133, in get_world_size
sync(temporary=True)
File "/opt/conda/lib/python3.6/site-packages/ignite/distributed/utils.py", line 64, in sync
sync(temporary=True)
File "/opt/conda/lib/python3.6/site-packages/ignite/distributed/utils.py", line 64, in sync
model = comp_model_cls.create_from_context()
File "/opt/conda/lib/python3.6/site-packages/ignite/distributed/comp_models/native.py", line 48, in create_from_context
model = comp_model_cls.create_from_context()
File "/opt/conda/lib/python3.6/site-packages/ignite/distributed/comp_models/native.py", line 48, in create_from_context
return _NativeDistModel()
File "/opt/conda/lib/python3.6/site-packages/ignite/distributed/comp_models/native.py", line 64, in __init__
return _NativeDistModel()
File "/opt/conda/lib/python3.6/site-packages/ignite/distributed/comp_models/native.py", line 64, in __init__
self._init_from_context()
File "/opt/conda/lib/python3.6/site-packages/ignite/distributed/comp_models/native.py", line 97, in _init_from_context
self._init_from_context()
File "/opt/conda/lib/python3.6/site-packages/ignite/distributed/comp_models/native.py", line 97, in _init_from_context
self._setup_attrs()
File "/opt/conda/lib/python3.6/site-packages/ignite/distributed/comp_models/base.py", line 26, in _setup_attrs
self._setup_attrs()
self._nproc_per_node = self._compute_nproc_per_node() if self.get_world_size() > 1 else 1
File "/opt/conda/lib/python3.6/site-packages/ignite/distributed/comp_models/base.py", line 26, in _setup_attrs
File "/opt/conda/lib/python3.6/site-packages/ignite/distributed/comp_models/native.py", line 101, in _compute_nproc_per_node
self._nproc_per_node = self._compute_nproc_per_node() if self.get_world_size() > 1 else 1
File "/opt/conda/lib/python3.6/site-packages/ignite/distributed/comp_models/native.py", line 101, in _compute_nproc_per_node
dist.all_reduce(tensor, op=dist.ReduceOp.MAX)
File "/opt/conda/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 938, in all_reduce
dist.all_reduce(tensor, op=dist.ReduceOp.MAX)
File "/opt/conda/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 938, in all_reduce
work = _default_pg.allreduce([tensor], opts)
RuntimeError: NCCL error in: ../torch/lib/c10d/ProcessGroupNCCL.cpp:558, invalid usage, NCCL version 2.7.8
work = _default_pg.allreduce([tensor], opts)
RuntimeError: NCCL error in: ../torch/lib/c10d/ProcessGroupNCCL.cpp:558, invalid usage, NCCL version 2.7.8
Traceback (most recent call last):
File "/opt/conda/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/opt/conda/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/opt/conda/lib/python3.6/site-packages/torch/distributed/launch.py", line 261, in <module>
main()
File "/opt/conda/lib/python3.6/site-packages/torch/distributed/launch.py", line 257, in main
cmd=cmd)
subprocess.CalledProcessError: Command '['/opt/conda/bin/python', '-u', 'tests/test_handler_rocauc_dist.py', '--local_rank=1']' returned non-zero exit status 1.
Something wrong with my NCCL version & ignite v0.4.2?
Thanks.
Issue Analytics
- State:
- Created 3 years ago
- Comments:6 (2 by maintainers)
Top Results From Across the Web
pytorch-ignite - PyPI
Ignite is a high-level library to help with training and evaluating neural networks in PyTorch flexibly and transparently. PyTorch-Ignite teaser.
Read more >pytorch-ignite (@pytorch_ignite) / Twitter
High-level library to help with training and evaluating neural networks in PyTorch flexibly and transparently.
Read more >PyTorch-Ignite: training and evaluating neural networks ...
The purpose of the PyTorch-Ignite ignite. distributed package introduced in version 0.4 is to unify the code for native torch.
Read more >ignite.distributed — PyTorch-Ignite v0.4.10 Documentation
Complete example of CIFAR10 training can be found here. ... Changed in version 0.4.2: Added Horovod distributed framework.
Read more >pytorch-ignite Changelog - pyup.io
Median metrics (e.g median absolute error) are now using `np.median`-compatible torch median implementation (2681) - Fixed issues when removing handlers on ...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Hi @vfdev-5 , we used this docker: https://ngc.nvidia.com/catalog/containers/nvidia:pytorch I think it’s PyTorch 1.7a. Thanks.
Hi @vfdev-5 ,
Thanks for your quick help, after adding
torch.cuda.set_device
, issue solved. I think maybe you can add some explicit warning if missing this setting as 0.4.2 implicitly relies on this setting.Thanks.