Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Make Pytorch-Lightning DDP work without SLURM

See original GitHub issue

🚀 Feature

Allow pytorch-lightning DDP mode to work everywhere ordinary pytorch DDP can work. Basically if every node in a cluster defines the following environment variables it should work:

MASTER_PORT: A free port on the machine that will host the process with rank 0.
MASTER_ADDR: IP address of the machine that will host the process with rank 0.
WORLD_SIZE: The total number of processes, so that the master knows how many workers to wait for.
RANK: Rank of each process, so they will know whether it is the master of a worker.

See pytorch documentation

Motivation

Pytorch-lightning positions itself as a framework wrapper around pytorch. One of it’s differentiator features is the ease of distributed learning and it is very counter intuitive that it doesn’t work in cases where vanilla pytorch does.

For example in Kubeflow there is a special operator PyTorchJob that spawns worker nodes with proper environment variables so that pytorch.distributed. init_process_group establishes communication between processes.

Pitch

While the user is able to override LightningModule.init_ddp_connection to the following:

    def init_ddp_connection(self, proc_rank: int, world_size: int) -> None:
        torch.distributed.init_process_group(
            'nccl', rank=proc_rank, world_size=world_size)

there’s at least one more place that is coupled tightly with SLURM and impedes running it inside ordinary pytorch distributed environment: its TrainerDDPMixin.ddp_train method:

    def ddp_train(self, gpu_idx, model):
        """
        Entry point into a DP thread
        :param gpu_idx:
        :param model:
        :param cluster_obj:
        :return:
        """
        # node rank using relative slurm id
        # otherwise default to node rank 0
        try:
            node_id = os.environ['SLURM_NODEID']
            self.node_rank = int(node_id)
        except Exception:
            self.node_rank = 0

One possible solution is to add another check for os.environ[‘RANK’] instead of just assigning 0 rank to the node in case SLURM variable is missing.

Alternatives

Additional context

Issue Analytics

State:
Created 3 years ago
Reactions:6
Comments:9 (2 by maintainers)

Top GitHub Comments

3reactions

csvancecommented, Mar 23, 2021

Multi node DDP works without Slurm still in 1.1.6, but doesn’t seem to work in 1.2.4. It appears there was a major refactor of DDPPlugin between those versions.

One other thing is the documentation doesn’t mention you need to set LOCAL_RANK per GPU as well. Say you are training on 2 nodes each with 2 GPU. At least in 1.1.6 Lightning won’t spawn a process per GPU, you need to set the local rank and start it yourself.

On the first node:

MASTER_ADDR=MasterNode MASTER_PORT=12345 WORLD_SIZE=4 NODE_RANK=0 LOCAL_RANK=0 python train.py
MASTER_ADDR=MasterNode MASTER_PORT=12345 WORLD_SIZE=4 NODE_RANK=0 LOCAL_RANK=1 python train.py

On the second node:

MASTER_ADDR=MasterNode MASTER_PORT=12345 WORLD_SIZE=4 NODE_RANK=1 LOCAL_RANK=0 python train.py
MASTER_ADDR=MasterNode MASTER_PORT=12345 WORLD_SIZE=4 NODE_RANK=1 LOCAL_RANK=1 python train.py

3reactions

faizanahemadcommented, Dec 21, 2020

Any documentation about how to train on multi-node without slurm?

Top Results From Across the Web

Make Pytorch-Lightning DDP work without SLURM #1345

Basically if every node in a cluster defines the following environment variables it should work: MASTER_PORT : A free port on the machine...

Run on an on-prem cluster (advanced) - PyTorch Lightning

Run on an on-prem cluster (advanced). Run on a SLURM managed cluster. Lightning automates the details behind training on a SLURM-powered cluster.

Trivial Multi-Node Training With Pytorch-Lightning

To train the PTL model across multiple-nodes just set the number of nodes in the trainer: If you create the appropriate SLURM submit...

jean-zay-users/jean-zay-doc - Gitter

I'm having an issue whenever I want to use the DistributedDataParallel (DDP) mode available for multi-gpu implementations. There are multiple scenarios but none ......

Introducing Ray Lightning: Multi-node PyTorch ... - Anyscale

It works similar to the built-in DDPSpawn Plugin that PyTorch Lightning has, but instead of spawning new processes for training, the RayPlugin ...