Compatibility between ORTModule and DeepSpeed
See original GitHub issueHi folks,
I am recently working on validating distributed training features while using ORTModule
, here are some incompatibilities that I found during some tests:
[With DeepSpeed]
- ZeRO Stage 1 and 2 work well
- ZeRO Stage 3 ❌
Warnings:
/usr/local/lib/python3.8/dist-packages/onnxruntime/training/ortmodule/_io.py:558:
UserWarning: This model cannot be deep copied (or pickled), which is a required step for stateful models to be properly exported to ONNX. Compute will continue, but unexpected results may occur!
warnings.warn("This model cannot be deep copied (or pickled)
- BF16 ❌
Error Message:
RuntimeError: /onnxruntime_src/orttraining/orttraining/python/orttraining_pybind_state.cc:752
onnxruntime::python::addObjectMethodsForTraining(pybind11::module&, onnxruntime::python::ExecutionProviderRegistrationFn)::<lambda(onnxruntime::training::OrtModuleGraphBuilder*,
const pybind11::bytes&, const onnxruntime::training::OrtModuleGraphBuilderConfiguration&)>
[ONNXRuntimeError] : 10 : INVALID_GRAPH : This is an invalid model. Type Error: Type 'tensor(bfloat16)' of input parameter
(_original_module.distilbert.embeddings.word_embeddings.weight) of operator (ATen) in node (ATen_17) is invalid
[With Fairscale]
- Can only shard optimizer state
Environment
- OS: Ubuntu 20.04
- CUDA/cuDNN version: 11.3/8
- onnxruntime-training: 1.11.1+cu113
- torch: 1.11.0+cu113
- torch-ort: 1.11.1
- Python version:3.8
- GPU: A100
I would like to confirm with you folks if these behaviors are intended? And concerning the compatibility with DeepSpeed stage 3 and BF16, would it be possible to have some insights on if it would be supported in the future?
Thanks a lot!
Issue Analytics
- State:
- Created a year ago
- Comments:6
Top Results From Across the Web
Supporting efficient large model training on AMD Instinct ...
DeepSpeed is composable with ONNX Runtime using the open source ORTModule that is part of ONNX Runtime for PyTorch package. This allows the ......
Read more >Issues · pytorch/ort - GitHub
Compatibility between ORTModule and DeepSpeed. #108 opened on May 4 by JingyaHuang · 6. Seg fault while training model with maxpool op.
Read more >DeepSpeed Integration - Hugging Face
Inference: DeepSpeed ZeRO Inference supports ZeRO stage 3 with ZeRO-Infinity. It uses the same ZeRO protocol as training, but it doesn't use an...
Read more >ORTTrainer — ORTModule 1.14.92+cpu documentation
For ONNX models, both name and order of input names must match. For model_desc['outputs'] entries, the order must match the original PyTorch's output...
Read more >Torch-ort Can Accelerate PyTorch Experiments on NVIDIA ...
Understand how ORTModule uses graph level optimizations, memory management optimizations, and composability with DeepSpeed to improve training throughput ...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Hi @JingyaHuang .
ORTModule
runs the pytorch model first before exporting the model to onnx. Because of this requirement, it tries to make a deepcopy of the original model and execute that (so as not to disturb the states in the original model while the export happens). But because this model is not deepcopyable, we need to issue the warning indicating that the model being exported might have some state change due to a single model execution before it is exported. In most cases, this should be a non-issue. If you encounter a problem, please reach out to us.Looking at the source code, it seems we have not added support for
bfloat16
for executing anAten
op yet. I believe that we have plans on adding that soon. Let me circle back on this and provide more details.Thanks for opening this issue.
Hi @baijumeswani ,
Thanks for adding the BF16 support for the
ATen
operator! I just tested it with:onnxruntime-training 1.12.0.dev20220523001+cu113
opset=15
)This time, it seems to be good with
ATen
, but I came up with another error as follows:If not mistaken, although
POW
BF16 is supported on ONNX 1.11.0, it and maybe other essential operators in transformers are not registered in ONNX Runtime for BF16, which leads to the failure on training. Is that correctly understood?Besides, one thing I can not understand well is that the training by
ORTModule
with BF16 enabled works well, whereas it doesn’t work when deepspeed(stage 1 nor stage 2) is enabled. Could you explain a little bit more about it?Thanks!