question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

DeepSpeed support for ignite.distributed

See original GitHub issue

🚀 Feature

Pytorch lightning recently added native support for MS DeepSpeed.

I believe it is also helpful for users if ignite incorporates the DeepSpeed pipeline for memory-efficient distributed training.

1. for idist.auto_model …?

To initialize the DeepSpeed engine:

model_engine, optimizer, _, _ = deepspeed.initialize(args=cmd_args,
                                                     model=model,
                                                     model_parameters=params)

And for distributed environment setup, we need to replace torch.distributed.init_process_group(...) to deepspeed.init_distributed()

2. checkpoint handler

slightly different thing for checkpointing

model_engine.save_checkpoint(args.save_dir, ckpt_id, client_sd = client_sd)

Issue Analytics

  • State:open
  • Created 2 years ago
  • Reactions:2
  • Comments:6 (4 by maintainers)

github_iconTop GitHub Comments

2reactions
vfdev-5commented, May 21, 2021

@Kashu7100 thanks for the feature request!

Yes, we plan to improve our support of deepspeed framework which is roughly:

  • cmd line launcher + config file
  • model_engine wrapper
  • various modern optimizers
  • pipeline parallelism
  • amp using nvidia/apex
  • customized distributed (support azure) on top of torch distributed

Our idea was to provide basic integration examples of how to use ignite and deepspeed together. I looked at it multiple times and due to certain overlap between the framework it was not obvious where to put the split.

@sdesrozis I’m not sure whether we should add it as a new backend or not. Let’s first create basic integration example and see which part of DeepSpeed code could be simplified using idist.

1reaction
sdesroziscommented, May 21, 2021

@Kashu7100 Finally, introducing a new backend does not seem to be the good option. Have a look here, and you would see that native PyTorch distributed is used when distributed environment variables are set.

That is a good news for simple use cases.

@sdesrozis Do you think it is possible to reuse idist.Parallel pipeline without modifications?

I would say yes.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Getting Started - DeepSpeed
If you prefer to launch your training job using MPI (e.g., mpirun), we provide support for this. It should be noted that DeepSpeed...
Read more >
pytorch/ignite v0.4.2 on GitHub - NewReleases.io
New release pytorch/ignite version v0.4.2 Improved distributed support (horovod framework, epoch-wise metrics, etc), new metrics/handlers, bug fixes and ...
Read more >
PyTorch Lightning vs Ignite: What Are the Differences?
Distributed training is supported in Ignite, but it has to be configured accordingly by the user, which can take a lot of effort....
Read more >
pytorch-ignite on Twitter: "In addition, we are starting to provide pre ...
PyTorch-Ignite v0.4.2 is available with improved distributed support (horovod framework, epoch-wise metrics, etc), new metrics/handlers, bug fixes and ...
Read more >
ignite.distributed — PyTorch-Ignite v0.4.10 Documentation
We provide a context manager to simplify the code of distributed configuration setup for all above supported backends.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found