Add possibility to output attentions from the ONNX models
See original GitHub issueFeature request
Hello!
I am interested in the possibility of retrieving not only the logits from the ONNX exported model, but also the attentions (which can be set via output_attentions=True in transformers). I have tried setting output_attentions=True in the config.json file of the model repository, but an exception is raised when the onnxruntime InferenceSession is initialized.
Fail: [ONNXRuntimeError] : 1 : FAIL : Load model from ort_model_optimized.onnx failed:/private/var/folders/24/8k48jl6d249_n_qfxwsl6xvm0000gn/T/pip-req-build-u762x_rp/onnxruntime/core/graph/graph.cc:1236 void onnxruntime::Graph::InitializeStateFromModelFileGraphProto() This is an invalid model. Graph output (attentions) does not exist in the graph
Motivation
I am using attention scores for calculating the importance, but it is currently impossible to extract them in our production environment as we are using ONNX-exported transformers models.
Your contribution
If you could guide me a bit on what to do, I could submit a PR.
Issue Analytics
- State:
- Created a year ago
- Comments:6 (6 by maintainers)
Top Results From Across the Web
Add possibility to output attentions from the ONNX models
Hello! I am interested in the possibility of retrieving not only the logits from the ONNX exported model, but also the attentions (which...
Read more >Exporting transformers models - Hugging Face
Starting from transformers v2.10.0 we partnered with ONNX Runtime to provide an easy export of transformers models to the ONNX format.
Read more >Transformer Model Optimization Tool Overview - onnxruntime
Transformer Model Optimization Tool Overview. ONNX Runtime automatically applies most optimizations while loading a transformer model.
Read more >Prediction with ONNX Models - OpenVINO™ Documentation
Model Server accepts ONNX models with the same versioning structure. Similar to IR, place each ONNX model file in a separate model version...
Read more >Implement a new converter using other converters - ONNX
We could then reuse the converter associated to this model. ... We retrieve the input type add tells the output type has the...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Hello @JingyaHuang, after taking a deeper look, it turned out that attention fusion was causing the aforementioned problem, so just using
OptimizationConfig(disable_attention=True)
worked well enough (although the attention fusion is not working, the remaining fusion optimization operations still work correctly).Thank you for your time and guidance!
Hi @jegork , glad to hear that you have found a way to solve it. By setting
optimize_with_onnxruntime_only=True
you will turn off the whole fusion optimization, have you tried precisingFusionOptions
which can be set within yourOptimizationConfig
to locate which fusion leading to the failure? This might help enabling some other non-fatal fusions.