question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. ItĀ collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Multiple GPU per node could fail silently with KubeflowEnvironment

See original GitHub issue

šŸ› Bug

If the user tries to submit a ddp job to a Kubeflow env with multi-gpus per node by following multi-GPU docs and passing the right args (num_nodes and devices) one of the following would happen:

  • WORLD_SIZE and RANK are set to total number of processes -> the job gets stuck because creates_processes_externally=True doesn’t let ddp launch other processes.
  • WORLD_SIZE and RANK are set to total number of nodes -> the job starts with only local rank 0 of each node participating in distributed training. The major issue here apart from the idle GPUs is that DDPStrategy still works correctly and passes the right number of replicas to the distributed sampler:
...
        self.cluster_environment.set_global_rank(self.node_rank * self.num_processes + self.local_rank)
        self.cluster_environment.set_world_size(self.num_nodes * self.num_processes)

So local rank 0 GPUs will get 1/num_processes of the data assuming other (idle) GPUs are processing the rest. All while training is being done only on a subset of the dataset that was assigned to local rank 0 of each node. The user is unaware of this since they assume they passed devices/gpus and num_nodes to trainer correctly.

To Reproduce

N/A (it’s how KubeflowEnvironment works)

Expected behavior

I’m not sure if this is the expected behavior. I am using Google Vertex AI that runs Kubeflow under the hood. When a Pytorch Lightning job is submitted to Vertex, Pytorch Lightning automatically selects KubeflowEnvironment as the cluster environment.

Please let me know if the expectation is to have a separate cluster environment class for something like VertexAI. I’d be happy to create a PR to add the new Env. But the reason why I decided to report this as a bug are:

  1. KubeflowEnvironment has two very specific requirements a. nodes with a single GPU and b. manual creation of the processes. Neither of these requirements are related to or enforced by Kubeflow. The requirements are also not mentioned in the docs and the user wouldn’t know this until they look at the code.
  2. The detect method of KubeflowEnvironment can be used for any Kubernetes env, and the rest of its methods basically implement an especial case of LightningEnvironment where the user has to manually run the processes.

cc @awaelchli

Issue Analytics

  • State:open
  • Created a year ago
  • Reactions:2
  • Comments:9 (6 by maintainers)

github_iconTop GitHub Comments

4reactions
RamtinRassolicommented, Jun 24, 2022

PytorchJob operator sets the WORLD_SIZE to the total number of replicas by default (here and here) which is different from what torch and lightning expect. So KubeflowEnvironment should let DDPStrategy set global_rank /world_size and create processes externally if needed.

Updating the following methods would be enough to make KubeflowEnvironment a generic env that’s compatible with Trainer args and multi-gpu clusters:

    @property
    def creates_processes_externally(self) -> bool:
        return "LOCAL_RANK" in os.environ

    def global_rank(self) -> int:
        return self._global_rank

    def set_global_rank(self, rank: int) -> None:
        self._global_rank = rank

    def local_rank(self) -> int:
        return int(os.environ.get("LOCAL_RANK", 0))

    def node_rank(self) -> int:
        return int(os.environ["RANK"])

That said this would make it very similar to LightningEnvironment. Not sure if that’s a problem.

1reaction
RamtinRassolicommented, Jul 21, 2022

@neggert yes, LOCAL_RANK would be set for subprocesses spun up by PL. And what you said about the PyTorchJob’s assumption makes sense, it’s just that ideally the KubeflowEnv and LightningEnv should interpret num_nodes the same way.

@awaelchli I’ll send a PR with the proposed changes soon. Thank you both!

Read more comments on GitHub >

github_iconTop Results From Across the Web

Enabling GPU and TPU - Kubeflow
This page describes how to enable GPU or TPU for a pipeline on GKE by using the Pipelines DSL language. Prerequisites.
Read more >
Issues and Workarounds
Symptom: HPE Ezmeral Data Fabric on Kubernetes pods and Livy pods could not submit any queries successfully. These queries will fail, if customer's...
Read more >
Changelog — PyTorch Lightning 1.8.5 documentation
When a multi-element tensor is logged, an error is now raised instead of silently taking the mean of all elements (#13164). The WandbLogger...
Read more >
Distributed GPU Training - AWS Deep Learning Containers
The scheduler then lets all processes know about every other node in the cluster, so that they can communicate with each other. Server:...
Read more >
LightGBM - Read the Docs
On Linux a GPU version of LightGBM (device_type=gpu) can be built using OpenCL, ... the weight of each node is w * (n...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found