question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Introduce Collaborative Training Strategy!

See original GitHub issue

🚀 Feature

Over the past few months, I’ve been working with a library called hivemind, where the hivemind team have done amazing things such as the training transformers together project. The goal of hivemind is to be able to train across the internet collaboratively, over different machines (like a swarm of machines) rather than having to rely on specialized machine setups (as we traditionally see for distributed training).

I have a working hacky PoC here, however, I’ve made a few iterations privately namely making it a Strategy! The CollaborativeStrategy allows machines to connect to each other by passing a list of peers. The Strategy helps makes the experience with hivemind far easier (small things like handling peers/DHT or making changes to the scaler/module where configurations require) and reducing boilerplate in code. I’ve also successfully trained with the CollaborativeStrategy across spot instances, showing that unreliable GPU training is possible!

The main goal of this ticket is to come to an agreement on whether the CollaborativeStrategy should live in PyTorch Lightning, or as a separate integration (in its own repo).

Motivation

I believe that a new strategy within PyTorch Lightning will bring users who would like to use spot instances or unreliable GPU machines distributed training to Pytorch Lightning, and make them aware that it is possible!

Suggested API

The Strategy option makes more sense, as we have to control some of the behaviour of the precision plugin as well as control certain pieces of training. More importantly, currently, the hivemind integration will not work with any other strategy. Being a strategy ensures that the strategy is exclusive.

import pytorch_lightning as pl
from pytorch_lightning.strategies import CollaborativeStrategy

trainer = pl.Trainer(
    strategy=CollaborativeStrategy(target_batch_size=8192)
),

When users run the code, they are given a message on how to get clients to join:

python train.py
# Other peers can connect via:
# "INITIAL_PEERS=<PEERS> python ...
# or pass the peers to the strategy: 
# CollaborativeStrategy(initial_peers='<PEERS>')

Pros/Cons

Why we should add this to PyTorch Lightning

  • Easier access for users who want to use the CollaborativeStrategy (just have to install hivemind), hopefully drawing more users who are interested!
  • No need to maintain a separate entire repo, with its own CI/docs/maintainers
  • Relatively few lines (~400 lines), relying on HiveMind for the heavy lifting

Why it should exist elsewhere

  • Can exist independent of Lightning. Since internals are not touched, this is just a third party integration similar to Bagua/DeepSpeed etc (naturally it also means responsibilities are kept separate)
  • Will increase PyTorch Lightning CI time and potentially make things even more complicated (we have to install deepspeed/fairscale/bagua and now hivemind as well?!)

Alternatives

An alternative would be for the strategy to exist in Hivemind. I haven’t spoken to the engineers about this (who can pitch in below) but could be viable. I would be concerned primarily out of the Hivemind repo being pretty complicated to support this type of distributed training already.

Additional Context

  • Hivemind team have already been assisting in the development of the strategy, and I’m sure they’ll help us maintain the Strategy if needed!

Please leave comments and stuff below, thanks for reading!

cc @borda @awaelchli @rohitgr7 @akihironitta @justusschock @justheuristic @mryab

Issue Analytics

  • State:closed
  • Created a year ago
  • Reactions:25
  • Comments:6 (6 by maintainers)

github_iconTop GitHub Comments

5reactions
tchatoncommented, Apr 6, 2022

Hey, @SeanNaren I am in favor of adding the Collaborative Strategy inside of PyTorch Lightning framework.

As stated, I believe this would make the Collaborative Strategy more discoverable and help boost the adoption of such new technology.

2reactions
awaelchlicommented, Apr 9, 2022

I’m also in favor of adding it. Bring it home, @SeanNaren!

Will increase PyTorch Lightning CI time and potentially make things even more complicated (we have to install deepspeed/fairscale/bagua and now hivemind as well?!)

By how much do you estimate? The majority of tests should be unit tests and not add any significant time. What kind of integration/benchmark test did you have in mind?

Read more comments on GitHub >

github_iconTop Results From Across the Web

Collaborative Learning Tips And Strategies For Teachers
20 Collaborative Learning Tips And Strategies For Teachers · 1. Establish clear group goals · 2. Keep groups midsized · 3. Establish flexible...
Read more >
5 Strategies to Deepen Student Collaboration - Edutopia
These are five strategies to encourage effective collaboration. Start of newsletter promotion. ... Establish expectations and norms for working together.
Read more >
6 Collaborative Learning Techniques to Help You Get the ...
Taking some inspiration from the classroom, we rounded up six easy-to-implement Collaborative Learning techniques that will liven up any training program.
Read more >
Collaborative Learning - Center for Teaching Innovation
Provide opportunities for students to develop rapport and group cohesion through icebreakers, team-building, and reflection exercises. Give students time to ...
Read more >
8 Types of Collaborative Training in the Workplace (+Benefits)
Collaborate training programs aren't just regular team trainings; they include several techniques designed to maximize learning and engagement ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found