question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Early experiments with Multi-GPU training and PyTorch Lightning

See original GitHub issue

I spent a chunk of the weekend experimenting with multi-GPU training using PyTorch lightning. I’m starting up a discussion here to document the steps I took and to hopefully start brainstorming how we can bring multi-GPU training to DeepChem.

Here are a first few thoughts on how to get multi-GPU to DeepChem:

  • The “simplest” strategy would be to start using PyTorch Lightning in TorchModel. The challenge here is that PyTorch Lightning has a good amount of “magic” required. Instead of nn.Module, you need to use pl.LightningModule. It’s not clear how well other frameworks like DGL/PyG work with PyTorch Lightning yet and we depend heavily on these frameworks for graphconv primitives. PyTorch Lightning is also under rapid development and not entirely API stable. I found a lot of broken tutorials and discussions on forums explaining that flags/arguments had changed.
  • As one potential alternative, we could try using lower level distributed primitives from PyTorch directly (see https://pytorch.org/tutorials/beginner/dist_overview.html) for distributed training. I haven’t yet experimented with these myself so I’m not sure how easy/hard this would be.

It’s not yet clear to me what the right path to multi-GPU is for DeepChem. We have many models using many backend frameworks. We’re shifting towards PyTorch as our mainstay but even there we have PyG/DGL dependencies. The ideal implementation would be to upgrade TorchModel to support distributed training so all DeepChem Torch models are distributed out of box but this will take some careful work. I think this will be a very powerful feature for us since many new models for the sciences (AlphaFold-2, ChemBERTa, ProteinBERT, etc) all depend on large scaling training.

Here are a couple suggested questions for us to think about:

  1. Are there any serious blockers to adopting PyTorch Lightning beyond API instability? (For example, some serious incompatibility with DGL/PyG)
  2. How does PyTorch Lightning implemented distributed training under the hood? Can we try to understand how Lightning does it and recreate similar infrastructure directly in TorchModel?
  3. How serious an issue is GPU utilization? Will we need to make infrastructure upgrades to DiskDataset to really get the most out of multi-GPU?

CC @peastman @ncfrey @seyonechithrananda

Issue Analytics

  • State:open
  • Created 2 years ago
  • Comments:5 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
rbharathcommented, Jul 12, 2021

@ncfrey A tutorial along the lines you mention would be great! It could be a really nice way to scale DeepChem models right away without needing to add heavy duty new infrastructure. If you have bandwidth, a tutorial PR would be an awesome contribution 😃

1reaction
ncfreycommented, Jul 12, 2021

I recently migrated completely to PL for distributed model training, so this is great!

I think to fully utilize PL, you simply need your data available as a LightningDataModule and you can inject your PyTorch model into a LightningModule. I really like this framework because the PyTorch dataloaders and models remain exactly the same - you simply wrap them in the corresponding PL classes.

Following that approach, to @peastman’s point, it isn’t even necessary to directly support PL as a dependency. There could be a tutorial that shows how to take any DeepChem PyTorch model and dataset, wrap them in the PL style, and do distributed training. I am doing this already so it would be easy to put together.

Read more comments on GitHub >

github_iconTop Results From Across the Web

GPU training (Intermediate) - PyTorch Lightning - Read the Docs
GPU training (Intermediate). Audience: Users looking to train across machines or experiment with different scaling techniques. Distributed Training strategies.
Read more >
Distributed Deep Learning With PyTorch Lightning (Part 1)
This means you can run on a single GPU, multiple GPUs, or even multiple GPU nodes (servers) with zero code changes. Lightning exists...
Read more >
Multi-GPU Training Using PyTorch Lightning
In this article, we take a look at how to execute multi-GPU training using PyTorch Lightning and visualize GPU performance in Weights &...
Read more >
Running multiple GPU ImageNet experiments using Slurm ...
My approach uses multiple GPUs on a compute cluster using SLURM (my university cluster), Pytorch, and Lightning.
Read more >
Accelerate training with multiple GPUs using PyTorch ...
PyTorch Lightning supports training by using multiple GPUs which helps AI researchers and ML Engineers extensively.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found