Quantization with `transformers.onnx`
See original GitHub issueEnvironment info
transformers
version: 4.11.3- Platform: Linux-5.4.0-1059-aws-x86_64-with-glibc2.27
- Python version: 3.9.5
- PyTorch version (GPU?): 1.9.1+cu102 (True)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using GPU in script?: True
- Using distributed or parallel set-up in script?: False
Who can help
Documentation: @sgugger
Information
Model I am using (XLMRobertaForTokenClassification):
The problem arises when using trying to
- my own modified scripts:
The tasks I am working on is:
- my own task or dataset (which I cannot show because of data security reasons)
Problem
At the moment, we are using the old Graph conversion approach convert_graph_to_onnx.py
to export our models to ONNX. We used the quantized version. Now we would like to update to the new transformers.onnx
package but we are not sure how to using quantization (see code example 1 below). The documentation is lacking a part how to use quantization with the new package. We tried to use the old method for quantization which worked: we used code example 2 for checking if there were lines which contained the phrase “quantized” (which was true). But when we used the quantized model for inference, the scores dropped massively. My question: Is the usage of the quantization correct for the new package or should we wait for an updated version?
Code example 1
from transformers.convert_graph_to_onnx import convert_pytorch, quantize, verify
from transformers.onnx.convert import export, validate_model_outputs
from transformers.onnx.features import FeaturesManager
def onnx_export(
model_directory: Path,
model_filepath: Path,
tokenizer: PreTrainedTokenizer,
atol: float = 0.0001,
feature: str = "default",
opset: int = 12,
quantize_model: bool = False,
):
"""Export model to ONNX.
Note
----
Code taken and modified from:
https://github.com/huggingface/transformers/blob/master/src/transformers/onnx/__main__.py
Parameters
----------
model_directory : Path
Path to model directory
model_filepath : Path
Filepath to save model to
tokenizer : PreTrainedTokenizer
Pre-trained tokenizer.
atol : float, optional
Absolute difference tolerence when validating the model, by default 0.0001
feature : str, optional
Export the model with some additional feature, by default "default"
opset : int, optional
ONNX opset to use, by default 12
quantize_model : bool, optional
Quantize the model to be run with int8, by default False
Raises
------
ValueError
If parameter 'opset' is not sufficient to export the chosen kind of model.
"""
if feature not in [
"default",
"causal-lm",
"seq2seq-lm",
"sequence-classification",
"token-classification",
"multiple-choice",
"question-answering",
]:
feature = "default"
# Allocate the model
model = FeaturesManager.get_model_from_feature(feature, model_directory)
model_kind, model_onnx_config = FeaturesManager.check_supported_model_or_raise(
model, feature=feature
)
onnx_config = model_onnx_config(model.config)
# Ensure the requested opset is sufficient
if opset < onnx_config.default_onnx_opset:
raise ValueError(
f"Opset {opset} is not sufficient to export {model_kind}. "
f"At least {onnx_config.default_onnx_opset} is required."
)
_, onnx_outputs = export(tokenizer, model, onnx_config, opset, model_filepath)
validate_model_outputs(onnx_config, tokenizer, model, model_filepath, onnx_outputs, atol)
if quantize_model:
quantized_model = quantize(model_filepath)
verify(quantized_model)
# remove the original model
model_filepath.unlink()
# rename quantized model
quantized_model.rename(str(model_filepath.resolve()))
Code example 2
import onnx
model = onnx.load("model.onnx")
onnx.checker.check_model(model)
print(onnx.helper.printable_graph(model.graph))
Issue Analytics
- State:
- Created 2 years ago
- Reactions:5
- Comments:5 (3 by maintainers)
Hello! As Lysandre said, optimization features are currently added to optimum.
That being said, I see one potential reason for the scores dropping: in the old graph conversion script, you have an optimize step, which performs many optimizations on the graph. The resulting graph has a different topology than the one initially converted to ONNX: quantization is applied to this optimized version. Now in your code example 1, you are applying quantization directly to the converted ONNX model, so one thing you can try is optimizing the converted model (the same way it is done in the old conversion script), then applying quantization to this optimized version. Not only the resulting model will be faster, it might solve your issue as well.
Hello! The new package does not have a quantization option as we’re moving all performance optimization features in a separate library with the sole focus of accelerating the performance of models.
The package is the following: https://github.com/huggingface/optimum
You can find a bit of documentation about the feature here: https://github.com/huggingface/optimum/tree/main/src/optimum/onnxruntime
The docs are currently a work in progress and should improve significantly over the coming weeks/months.
As for the questions regarding the quantization, I will let @michaelbenayoun and @mfuntowicz answer 😃