Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Using ignite.distributed with 3 or more processes hangs indefinitely

See original GitHub issue

❓ Questions/Help/Support

Trying to use the ignite.distributed to train a model with DDP. The issue I encounter is that when spawning 3 or processes to run my code, it seems to hang indefinitely. Works fine with 2 processes. I even tried a very basic script and still hangs (similar to the tutorial).

# run.py
import torch
import ignite.distributed as idist

def run(rank, config):
    print(f"Running basic DDP example on rank {rank}.")

def main():
    world_size = 4  # if this is 3 or more it hangs

    # some dummy config
    config = {}

    # run task
    idist.spawn("nccl", run, args=(config,), nproc_per_node=world_size)
    
    # the same happens even in this case
    # with idist.Parallel(backend="nccl", nproc_per_node=world_size) as parallel:
    #     parallel.run(run, config)

if __name__ == "__main__":
    main()

Executing this with:

python -m module.run

I’d be very grateful if anyone can weigh in on this.

Environment

PyTorch Version: 1.9.0
Ignite Version: 0.4.6
OS: Ubuntu 20.04.2 LTS
How you installed Ignite (conda, pip, source): conda
Python version: 3.9.6
Any other relevant information: Running on 4 A100-PCIE-40GB GPUs

Issue Analytics

State:
Created 2 years ago
Comments:16 (3 by maintainers)

Top GitHub Comments

2reactions

ivankitanovskicommented, Sep 7, 2021

Thanks for the quick feedback again. We are working on a package (aitlas) that has multiple modules in it. I am rnning the script in the root of package with python -m package.test. The package/test.py script has the code I shared earlier.

The server has 4 A100-PCIE-40GB GPUs. It has 2TB of RAM memory and 256 AMD EPYC 7742 64-Core Processors.

Pytorch (1.9.0) and Pytorch-ignite (0.4.6.) are installed with conda (4.9.2).

This is appears in the logs when running:

(aitlas) user@kt-gpu2:~/aitlas$ python -m aitlas.test
2021-09-07 18:43:57,015 ignite.distributed.launcher.Parallel INFO: Initialized distributed launcher with backend: 'nccl'
2021-09-07 18:43:57,015 ignite.distributed.launcher.Parallel INFO: - Parameters to spawn processes:
        nproc_per_node: 4
        nnodes: 1
        node_rank: 0
2021-09-07 18:43:57,015 ignite.distributed.launcher.Parallel INFO: Spawn function '<function run at 0x7f51227a2e50>' in 4 processes
[W ProcessGroupNCCL.cpp:1569] Rank 0 using best-guess GPU 0 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device.
[W ProcessGroupNCCL.cpp:1569] Rank 2 using best-guess GPU 2 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device.
[W ProcessGroupNCCL.cpp:1569] Rank 1 using best-guess GPU 1 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device.
[W ProcessGroupNCCL.cpp:1569] Rank 3 using best-guess GPU 3 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device.

And, it indefinetly stays in this state.

1reaction

vfdev-5commented, Sep 13, 2021

Thanks for the feedback, @ivankitanovski ! I’d expect gloo a bit slower than nccl. For completeness sake, could you still please run above commands to understand what happened with nccl.