question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Using ignite.distributed with 3 or more processes hangs indefinitely

See original GitHub issue

❓ Questions/Help/Support

Trying to use the ignite.distributed to train a model with DDP. The issue I encounter is that when spawning 3 or processes to run my code, it seems to hang indefinitely. Works fine with 2 processes. I even tried a very basic script and still hangs (similar to the tutorial).

# run.py
import torch
import ignite.distributed as idist

def run(rank, config):
    print(f"Running basic DDP example on rank {rank}.")

def main():
    world_size = 4  # if this is 3 or more it hangs

    # some dummy config
    config = {}

    # run task
    idist.spawn("nccl", run, args=(config,), nproc_per_node=world_size)
    
    # the same happens even in this case
    # with idist.Parallel(backend="nccl", nproc_per_node=world_size) as parallel:
    #     parallel.run(run, config)

if __name__ == "__main__":
    main()

Executing this with:

python -m module.run

I’d be very grateful if anyone can weigh in on this.

Environment

  • PyTorch Version: 1.9.0
  • Ignite Version: 0.4.6
  • OS: Ubuntu 20.04.2 LTS
  • How you installed Ignite (conda, pip, source): conda
  • Python version: 3.9.6
  • Any other relevant information: Running on 4 A100-PCIE-40GB GPUs

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:16 (3 by maintainers)

github_iconTop GitHub Comments

2reactions
ivankitanovskicommented, Sep 7, 2021

Thanks for the quick feedback again. We are working on a package (aitlas) that has multiple modules in it. I am rnning the script in the root of package with python -m package.test. The package/test.py script has the code I shared earlier.

The server has 4 A100-PCIE-40GB GPUs. It has 2TB of RAM memory and 256 AMD EPYC 7742 64-Core Processors.

Pytorch (1.9.0) and Pytorch-ignite (0.4.6.) are installed with conda (4.9.2).

This is appears in the logs when running:

(aitlas) user@kt-gpu2:~/aitlas$ python -m aitlas.test
2021-09-07 18:43:57,015 ignite.distributed.launcher.Parallel INFO: Initialized distributed launcher with backend: 'nccl'
2021-09-07 18:43:57,015 ignite.distributed.launcher.Parallel INFO: - Parameters to spawn processes:
        nproc_per_node: 4
        nnodes: 1
        node_rank: 0
2021-09-07 18:43:57,015 ignite.distributed.launcher.Parallel INFO: Spawn function '<function run at 0x7f51227a2e50>' in 4 processes
[W ProcessGroupNCCL.cpp:1569] Rank 0 using best-guess GPU 0 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device.
[W ProcessGroupNCCL.cpp:1569] Rank 2 using best-guess GPU 2 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device.
[W ProcessGroupNCCL.cpp:1569] Rank 1 using best-guess GPU 1 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device.
[W ProcessGroupNCCL.cpp:1569] Rank 3 using best-guess GPU 3 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device.

And, it indefinetly stays in this state.

1reaction
vfdev-5commented, Sep 13, 2021

Thanks for the feedback, @ivankitanovski ! I’d expect gloo a bit slower than nccl. For completeness sake, could you still please run above commands to understand what happened with nccl.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Can't get `Parallel.run` with NCCL in SLURM environment to ...
spawn spawns processes from a slurm environment which is not well defined for ignite . Indeed, if a slurm environment is detected, slurm ......
Read more >
Stopping ignite node when a transaction on this ... - Apache
Stopping ignite node when a transaction on this node in process of commit(rollback) may cause hang up. Status: Assignee: Priority: Resolution:.
Read more >
java - Few Ignite client node cache calls get stuck indefinitely ...
java - Few Ignite client node cache calls get stuck indefinitely instead of throwing ClientDisconnected/CacheStopped on server node restart - ...
Read more >
apacheignite/ignite - Gitter
What happens to entries in cache which hadn't been written to db while Ignite ... but mostly it simply hangs indefinitely with this...
Read more >
Apache Ignite Best Practices for Native Persistence and Data ...
Join https://www.meetup.com/Apache- Ignite -Virtual-Meetup/ to get more tech talks about Apache Ignite !02:42 Features that in -memory data grid ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found