Using ignite.distributed with 3 or more processes hangs indefinitely
See original GitHub issue❓ Questions/Help/Support
Trying to use the ignite.distributed
to train a model with DDP. The issue I encounter is that when spawning 3 or processes to run my code, it seems to hang indefinitely. Works fine with 2 processes. I even tried a very basic script and still hangs (similar to the tutorial).
# run.py
import torch
import ignite.distributed as idist
def run(rank, config):
print(f"Running basic DDP example on rank {rank}.")
def main():
world_size = 4 # if this is 3 or more it hangs
# some dummy config
config = {}
# run task
idist.spawn("nccl", run, args=(config,), nproc_per_node=world_size)
# the same happens even in this case
# with idist.Parallel(backend="nccl", nproc_per_node=world_size) as parallel:
# parallel.run(run, config)
if __name__ == "__main__":
main()
Executing this with:
python -m module.run
I’d be very grateful if anyone can weigh in on this.
Environment
- PyTorch Version: 1.9.0
- Ignite Version: 0.4.6
- OS: Ubuntu 20.04.2 LTS
- How you installed Ignite (
conda
,pip
, source): conda - Python version: 3.9.6
- Any other relevant information: Running on 4 A100-PCIE-40GB GPUs
Issue Analytics
- State:
- Created 2 years ago
- Comments:16 (3 by maintainers)
Top Results From Across the Web
Can't get `Parallel.run` with NCCL in SLURM environment to ...
spawn spawns processes from a slurm environment which is not well defined for ignite . Indeed, if a slurm environment is detected, slurm ......
Read more >Stopping ignite node when a transaction on this ... - Apache
Stopping ignite node when a transaction on this node in process of commit(rollback) may cause hang up. Status: Assignee: Priority: Resolution:.
Read more >java - Few Ignite client node cache calls get stuck indefinitely ...
java - Few Ignite client node cache calls get stuck indefinitely instead of throwing ClientDisconnected/CacheStopped on server node restart - ...
Read more >apacheignite/ignite - Gitter
What happens to entries in cache which hadn't been written to db while Ignite ... but mostly it simply hangs indefinitely with this...
Read more >Apache Ignite Best Practices for Native Persistence and Data ...
Join https://www.meetup.com/Apache- Ignite -Virtual-Meetup/ to get more tech talks about Apache Ignite !02:42 Features that in -memory data grid ...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Thanks for the quick feedback again. We are working on a package (aitlas) that has multiple modules in it. I am rnning the script in the root of package with
python -m package.test
. Thepackage/test.py
script has the code I shared earlier.The server has 4 A100-PCIE-40GB GPUs. It has 2TB of RAM memory and 256 AMD EPYC 7742 64-Core Processors.
Pytorch (1.9.0) and Pytorch-ignite (0.4.6.) are installed with conda (4.9.2).
This is appears in the logs when running:
And, it indefinetly stays in this state.
Thanks for the feedback, @ivankitanovski ! I’d expect gloo a bit slower than nccl. For completeness sake, could you still please run above commands to understand what happened with nccl.