Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

`ORTTrainer` doesn't work with distributed training and/or DeepSpeed

See original GitHub issue

ORTTrainer.train fails with distributed training and/or deepspeed.

The line inference_manager = model._torch_module._execution_manager._inference_manager assumes that model is of type ORTModule. However, when deepspeed is enabled, it is of type DeepSpeedEngine. Its type is DistributedDataParallel during distributed training.

This leads to an AttributeError for such cases.

Issue Analytics

State:
Created a year ago
Comments:5 (3 by maintainers)

Top GitHub Comments

1reaction

jambaykcommented, May 24, 2022

Hi @JingyaHuang, thanks for your work on this!

0reactions

JingyaHuangcommented, Jun 9, 2022

Closing this, as DeepSpeed is enabled with the release of transformers 4.19.3. Feel free to reach out if there is any further question.

Top Results From Across the Web

DeepSpeed gets stuck when training · Issue #12418 - GitHub

It's very possible that the distributed network gets stuck because of either of these 2 as it can't network. Deepspeed requires a fully ......

Train 1 trillion+ parameter models - PyTorch Lightning

For both fine-tuning and pre-training, use DeepSpeed Activation Checkpointing or Activation Checkpointing as the throughput degradation is not significant.

Increasing the scale and speed of deep learning ... - YouTube

In addition, the team will present deep-dive results on how they were able to obtain the world record for fastest BERT training. DeepSpeed...

[Ray Meetup] Ray Train, PyTorch, TorchX, and ... - YouTube

Welcome to our second Ray meetup, where we focus on Ray's native libraries for scaling machine learning workloads. We'll discuss Ray Train, ...

Distributed Training with PyTorch on Piz Daint - Session 1

The Piz Daint supercomputer at CSCS provides an ideal platform for supporting intensive deep learning workloads as it comprises thousands of ...