question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

`ORTTrainer` doesn't work with distributed training and/or DeepSpeed

See original GitHub issue

ORTTrainer.train fails with distributed training and/or deepspeed.

The line inference_manager = model._torch_module._execution_manager._inference_manager assumes that model is of type ORTModule. However, when deepspeed is enabled, it is of type DeepSpeedEngine. Its type is DistributedDataParallel during distributed training.

This leads to an AttributeError for such cases.

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:5 (3 by maintainers)

github_iconTop GitHub Comments

1reaction
jambaykcommented, May 24, 2022

Hi @JingyaHuang, thanks for your work on this!

0reactions
JingyaHuangcommented, Jun 9, 2022

Closing this, as DeepSpeed is enabled with the release of transformers 4.19.3. Feel free to reach out if there is any further question.

Read more comments on GitHub >

github_iconTop Results From Across the Web

DeepSpeed gets stuck when training · Issue #12418 - GitHub
It's very possible that the distributed network gets stuck because of either of these 2 as it can't network. Deepspeed requires a fully ......
Read more >
Train 1 trillion+ parameter models - PyTorch Lightning
For both fine-tuning and pre-training, use DeepSpeed Activation Checkpointing or Activation Checkpointing as the throughput degradation is not significant.
Read more >
Increasing the scale and speed of deep learning ... - YouTube
In addition, the team will present deep-dive results on how they were able to obtain the world record for fastest BERT training. DeepSpeed...
Read more >
[Ray Meetup] Ray Train, PyTorch, TorchX, and ... - YouTube
Welcome to our second Ray meetup, where we focus on Ray's native libraries for scaling machine learning workloads. We'll discuss Ray Train, ...
Read more >
Distributed Training with PyTorch on Piz Daint - Session 1
The Piz Daint supercomputer at CSCS provides an ideal platform for supporting intensive deep learning workloads as it comprises thousands of ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found