question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

How to use l2l.algorithms.MAML correctly with nn.DistributedDataParallel?

See original GitHub issue

This work is awesome!

Using nn.DistributedDataParallel in the following way will raise Error when execute learner = maml.clone() How to use it correctly? Should I use nn.DistributedDataParallel on MyModel and then use MAML? Thanks!

model = MyModel() maml = l2l.algorithms.MAML(model, lr=0.5) model = nn.DistributedDataPrallel(model, device_ids=[rank])learner = maml.clone()

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:13 (5 by maintainers)

github_iconTop GitHub Comments

3reactions
seba-1511commented, Aug 15, 2020

Hello @AyanamiReiFan, and thanks for the kind words.

Parallelizing MAML with DistributedDataParallel is a bit tricky as the implementation relies on gradient hooks which don’t play well with clone/grad. If you want to use torch.distributed to parallelize the training loop, cherry’s Distributed optimizer is another option:

opt = optim.Adam(model.parameters())
opt = Distributed(model.parameters(), opt, sync=1)
# Training code
opt.step()

If you want to parallelize the model over GPUs, I would use torch.nn.DataParallel:

learner = maml.clone()
learner = torch.nn.DataParallel(learner, device_ids=[0, 1])
# Training code

Let me know if you ever find a solution to using DistributedDataParallel, I’d be curious to know the solution.

1reaction
janbollecommented, Sep 5, 2020

@Kulbear

  1. I did not use a GPU version as the batch-sizes and NNs are relatively small and I suppose this won’t speed things up much in this setting - but you could easily implement it since Ray also supports GPU
  2. I only did experiments for Regression and Classification. Also, the implementation is done using TF2.0. Would the implementation be helpful for you?
Read more comments on GitHub >

github_iconTop Results From Across the Web

Does it support the model wrapped by DistributedDataParallel?
Correct, there are 2 ways to go about distributed MAML: either use DDP + LightningMAML, or use torch.distributed + cherry. Choose the one...
Read more >
learn2learn.algorithms
MAML (BaseLearner) ¶ ... High-level implementation of Model-Agnostic Meta-Learning. This class wraps an arbitrary nn.Module and augments it with clone() and adapt ...
Read more >
Distributed data parallel training in Pytorch
Pytorch provides a tutorial on distributed training using AWS, ... However, it doesn't have code examples of how to use nn.DataParallel .
Read more >
Making Meta-Learning Easily Accessible on PyTorch - Medium
The goal of a meta-learning algorithm is to use training experience to ... Model Agnostic Meta Learning (MAML) is a popular gradient-based ...
Read more >
How distributed training works in Pytorch - AI Summer
In this tutorial, we will learn how to use nn.parallel.DistributedDataParallel for training our models in multiple GPUs.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found