Quantizing ONNX text classification model based on "setu4993/smaller-LaBSE" causes much lower precision scores
See original GitHub issueSystem Info
- `transformers` version: 4.18.0
- Platform: Linux-5.13.0-39-generic-x86_64-with-glibc2.29
- Python version: 3.8.10
- Huggingface_hub version: 0.5.1
- PyTorch version (GPU?): 1.11.0+cu102 (True)
- Tensorflow version (GPU?): 2.3.0 (False)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using GPU in script?: yes
- Using distributed or parallel set-up in script?: no
optimum==1.1.0
Who can help?
@LysandreJik since LaBSE is based on BERT
Not sure who to ping for optimum.
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examples
folder (such as GLUE/SQuAD, …) - My own task or dataset (give details below)
Reproduction
The following code trains a simple binary text classification model using pytorch, then converts to ONNX and then quantizes the ONNX model, printing evaluation results for each. It first fine-tunes on “distilbert-base-uncased” and then “setu4993/smaller-LaBSE”. Scores don’t change much for the three model version on distilbert, but quantization lowers scores on LaBSE.
import os
import shutil
import numpy as np
import torch
from datasets import load_dataset, Dataset, load_metric
from optimum.onnxruntime.configuration import AutoQuantizationConfig
from optimum.onnxruntime import ORTQuantizer, ORTModel
from sklearn.metrics import precision_recall_fscore_support
from transformers import AutoTokenizer, AutoModelForSequenceClassification, DataCollatorWithPadding, TrainingArguments, Trainer
os.environ["WANDB_DISABLED"] = "true"
imdb = load_dataset("imdb").shuffle()
# Taking a subset to speed up training/testing times, same effect occurs on full dataset
imdb['train'] = Dataset.from_dict(imdb['train'][:1000])
imdb['test'] = Dataset.from_dict(imdb['test'][:1000])
metric = load_metric('f1')
def compute_metrics(eval_pred):
logits, labels = eval_pred
predictions = np.argmax(logits, axis=-1)
metrics = metric.compute(predictions=predictions, references=labels)
prec_rec = precision_recall_fscore_support(labels, predictions, average='binary')
metrics['precision'] = prec_rec[0]
metrics['recall'] = prec_rec[1]
return metrics
def train_eval_demo(model_name):
"""
Train a simple binary text classification model using the IMDB dataset
Convert to ONNX and then quantize the ONNX model and return evaluation results
"""
tokenizer = AutoTokenizer.from_pretrained(model_name)
def preprocess_function(examples):
return tokenizer(examples["text"], truncation=True, max_length=128)
tokenized_imdb = imdb.map(preprocess_function, batched=True)
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)
output_dir = "model_debug"
if os.path.exists(output_dir):
shutil.rmtree(output_dir)
training_args = TrainingArguments(
output_dir=f"./{output_dir}",
learning_rate=2e-5,
per_device_train_batch_size=8,
per_device_eval_batch_size=8,
num_train_epochs=5,
weight_decay=0.01,
overwrite_output_dir=True,
no_cuda=not torch.cuda.is_available(),
save_steps=1000,
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized_imdb["train"],
eval_dataset=tokenized_imdb["test"],
tokenizer=tokenizer,
data_collator=data_collator,
compute_metrics=compute_metrics,
)
trainer.train(resume_from_checkpoint=False)
trainer.save_model()
metrics = trainer.evaluate(metric_key_prefix="eval")
results = {}
results["1. PyTorch"] = metrics
# The model we wish to quantize
# The type of quantization to apply
qconfig = AutoQuantizationConfig.arm64(is_static=False, per_channel=False)
quantizer = ORTQuantizer.from_pretrained(output_dir, feature="sequence-classification")
# Quantize the model!
quantizer.export(
onnx_model_path=f"{output_dir}/onnx_model.onnx",
onnx_quantized_model_output_path=f"{output_dir}/onnx_model-quantized.onnx",
quantization_config=qconfig,
)
ort_model = ORTModel(f"{output_dir}/onnx_model.onnx", quantizer._onnx_config)
ort_outputs = ort_model.evaluation_loop(tokenized_imdb["test"])
onnx_metrics = compute_metrics((ort_outputs.predictions, tokenized_imdb["test"]["label"]))
results["2. ONNX"] = onnx_metrics
ort_model = ORTModel(f"{output_dir}/onnx_model-quantized.onnx", quantizer._onnx_config)
ort_outputs = ort_model.evaluation_loop(tokenized_imdb["test"])
onnx_quant_metrics = compute_metrics((ort_outputs.predictions, tokenized_imdb["test"]["label"]))
results["3. ONNX Quant"] = onnx_quant_metrics
return results
# Models to compare
model_names = ["distilbert-base-uncased", "setu4993/smaller-LaBSE"] #
model_scores = {}
for model_name in model_names:
results = train_eval_demo(model_name)
model_scores[model_name] = results
for model_name in model_names:
print("* " + model_name)
for iteration in sorted(model_scores[model_name]):
print(iteration)
print(f"\t{model_scores[model_name][iteration]}")
print()
Expected behavior
I'm building text classification models based on LaBSE (specifically the smaller version "setu4993/smaller-LaBSE"). After conversion to ONNX, the f1, precision and recall values are the same. However, after quantization scores drop a lot (precision 83.2 to 62.7):
* setu4993/smaller-LaBSE
1. PyTorch
{'eval_f1': 0.8576998050682261, 'eval_precision': 0.831758034026465, 'eval_recall': 0.8853118712273642}
2. ONNX
{'f1': 0.8576998050682261, 'precision': 0.831758034026465, 'recall': 0.8853118712273642}
3. ONNX Quant
{'f1': 0.7473598700243704, 'precision': 0.6267029972752044, 'recall': 0.9255533199195171}
What I would expect to happen is the scores only change by at most 1 or so points after quantization. The above code will reproduce these scores. It also occurs on larger LaBSE models, such as 'pvl/labse_bert' and on token classification tasks.
For comparison, here is the output using distilbert-base-uncased. No dramatic score changes.
* distilbert-base-uncased
1. PyTorch
{'eval_f1': 0.8359683794466404, 'eval_precision': 0.8213592233009709, 'eval_recall': 0.8511066398390342}
2. ONNX
{'f1': 0.8359683794466404, 'precision': 0.8213592233009709, 'recall': 0.8511066398390342}
3. ONNX Quant
{'f1': 0.8288822947576657, 'precision': 0.8151750972762646, 'recall': 0.8430583501006036}
Issue Analytics
- State:
- Created a year ago
- Comments:6 (3 by maintainers)
Top Results From Across the Web
Quantizing CLIP with ONNX Pt. 1: Smaller, Faster, Feasible?
This quantization will reduce the precision of the model's weights, shrinking the saved file to 25% of its original file size and providing...
Read more >Quantize ONNX Models | onnxruntime
There are two ways to represent quantized ONNX models: ... works best with transformer based models, and ONNX shape inference works with other...
Read more >Faster and smaller quantized NLP with Hugging Face and ...
Compared to PyTorch quantization, even with a smaller model, ONNX Runtime quantization showed the same accuracy and a slightly higher F1 score.
Read more >Inference pipelines - Hugging Face
The pipeline() function makes it simple to use models from the Model Hub for accelerated inference on a variety of tasks such as...
Read more >Achieving FP32 Accuracy for INT8 Inference Using ...
Model quantization is a popular deep learning optimization method in ... in the quantized tensor, but it also means using a lower precision...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Investigating this deeper, I found the problem was due to a small set of MatMul nodes which I found by excluding them one by one and checking the resulting effect on scores.
Moving that issue to optimum!