`ORTTrainer` doesn't work with distributed training and/or DeepSpeed
See original GitHub issueORTTrainer.train
fails with distributed training and/or deepspeed.
The line inference_manager = model._torch_module._execution_manager._inference_manager
assumes that model
is of type ORTModule
. However, when deepspeed is enabled, it is of type DeepSpeedEngine
. Its type is DistributedDataParallel
during distributed training.
This leads to an AttributeError for such cases.
Issue Analytics
- State:
- Created a year ago
- Comments:5 (3 by maintainers)
Top Results From Across the Web
DeepSpeed gets stuck when training · Issue #12418 - GitHub
It's very possible that the distributed network gets stuck because of either of these 2 as it can't network. Deepspeed requires a fully ......
Read more >Train 1 trillion+ parameter models - PyTorch Lightning
For both fine-tuning and pre-training, use DeepSpeed Activation Checkpointing or Activation Checkpointing as the throughput degradation is not significant.
Read more >Increasing the scale and speed of deep learning ... - YouTube
In addition, the team will present deep-dive results on how they were able to obtain the world record for fastest BERT training. DeepSpeed...
Read more >[Ray Meetup] Ray Train, PyTorch, TorchX, and ... - YouTube
Welcome to our second Ray meetup, where we focus on Ray's native libraries for scaling machine learning workloads. We'll discuss Ray Train, ...
Read more >Distributed Training with PyTorch on Piz Daint - Session 1
The Piz Daint supercomputer at CSCS provides an ideal platform for supporting intensive deep learning workloads as it comprises thousands of ...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Hi @JingyaHuang, thanks for your work on this!
Closing this, as DeepSpeed is enabled with the release of transformers 4.19.3. Feel free to reach out if there is any further question.