question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Quantizing ONNX text classification model based on "setu4993/smaller-LaBSE" causes much lower precision scores

See original GitHub issue

System Info

- `transformers` version: 4.18.0
- Platform: Linux-5.13.0-39-generic-x86_64-with-glibc2.29
- Python version: 3.8.10
- Huggingface_hub version: 0.5.1
- PyTorch version (GPU?): 1.11.0+cu102 (True)
- Tensorflow version (GPU?): 2.3.0 (False)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using GPU in script?: yes
- Using distributed or parallel set-up in script?: no
 
optimum==1.1.0

Who can help?

@LysandreJik since LaBSE is based on BERT

Not sure who to ping for optimum.

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, …)
  • My own task or dataset (give details below)

Reproduction

The following code trains a simple binary text classification model using pytorch, then converts to ONNX and then quantizes the ONNX model, printing evaluation results for each. It first fine-tunes on “distilbert-base-uncased” and then “setu4993/smaller-LaBSE”. Scores don’t change much for the three model version on distilbert, but quantization lowers scores on LaBSE.

import os
import shutil

import numpy as np
import torch
from datasets import load_dataset, Dataset, load_metric
from optimum.onnxruntime.configuration import AutoQuantizationConfig
from optimum.onnxruntime import ORTQuantizer, ORTModel
from sklearn.metrics import precision_recall_fscore_support
from transformers import AutoTokenizer, AutoModelForSequenceClassification, DataCollatorWithPadding, TrainingArguments, Trainer


os.environ["WANDB_DISABLED"] = "true"

imdb = load_dataset("imdb").shuffle()

# Taking a subset to speed up training/testing times, same effect occurs on full dataset
imdb['train'] = Dataset.from_dict(imdb['train'][:1000])
imdb['test'] = Dataset.from_dict(imdb['test'][:1000])

metric = load_metric('f1')


def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    metrics = metric.compute(predictions=predictions, references=labels)
    prec_rec = precision_recall_fscore_support(labels, predictions, average='binary')
    metrics['precision'] = prec_rec[0]
    metrics['recall'] = prec_rec[1]
    return metrics


def train_eval_demo(model_name):
    """
    Train a simple binary text classification model using the IMDB dataset
    Convert to ONNX and then quantize the ONNX model and return evaluation results
    """

    tokenizer = AutoTokenizer.from_pretrained(model_name)

    def preprocess_function(examples):
        return tokenizer(examples["text"], truncation=True, max_length=128)

    tokenized_imdb = imdb.map(preprocess_function, batched=True)

    data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

    model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)

    output_dir = "model_debug"
    if os.path.exists(output_dir):
        shutil.rmtree(output_dir)

    training_args = TrainingArguments(
        output_dir=f"./{output_dir}",
        learning_rate=2e-5,
        per_device_train_batch_size=8,
        per_device_eval_batch_size=8,
        num_train_epochs=5,
        weight_decay=0.01,
        overwrite_output_dir=True,
        no_cuda=not torch.cuda.is_available(),
        save_steps=1000,
    )

    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=tokenized_imdb["train"],
        eval_dataset=tokenized_imdb["test"],
        tokenizer=tokenizer,
        data_collator=data_collator,
        compute_metrics=compute_metrics,
    )

    trainer.train(resume_from_checkpoint=False)
    trainer.save_model()

    metrics = trainer.evaluate(metric_key_prefix="eval")

    results = {}
    results["1. PyTorch"] = metrics

    # The model we wish to quantize
    # The type of quantization to apply
    qconfig = AutoQuantizationConfig.arm64(is_static=False, per_channel=False)
    quantizer = ORTQuantizer.from_pretrained(output_dir, feature="sequence-classification")

    # Quantize the model!
    quantizer.export(
        onnx_model_path=f"{output_dir}/onnx_model.onnx",
        onnx_quantized_model_output_path=f"{output_dir}/onnx_model-quantized.onnx",
        quantization_config=qconfig,
    )

    ort_model = ORTModel(f"{output_dir}/onnx_model.onnx", quantizer._onnx_config)
    ort_outputs = ort_model.evaluation_loop(tokenized_imdb["test"])
    onnx_metrics = compute_metrics((ort_outputs.predictions, tokenized_imdb["test"]["label"]))
    results["2. ONNX"] = onnx_metrics

    ort_model = ORTModel(f"{output_dir}/onnx_model-quantized.onnx", quantizer._onnx_config)
    ort_outputs = ort_model.evaluation_loop(tokenized_imdb["test"])
    onnx_quant_metrics = compute_metrics((ort_outputs.predictions, tokenized_imdb["test"]["label"]))
    results["3. ONNX Quant"] = onnx_quant_metrics

    return results


# Models to compare
model_names = ["distilbert-base-uncased", "setu4993/smaller-LaBSE"]  #
model_scores = {}

for model_name in model_names:
    results = train_eval_demo(model_name)
    model_scores[model_name] = results

for model_name in model_names:
    print("* " + model_name)
    for iteration in sorted(model_scores[model_name]):
        print(iteration)
        print(f"\t{model_scores[model_name][iteration]}")
    print()

Expected behavior

I'm building text classification models based on LaBSE (specifically the smaller version "setu4993/smaller-LaBSE").  After conversion to ONNX, the f1, precision and recall values are the same.  However, after quantization scores drop a lot (precision 83.2 to 62.7):

* setu4993/smaller-LaBSE
1. PyTorch
	{'eval_f1': 0.8576998050682261, 'eval_precision': 0.831758034026465, 'eval_recall': 0.8853118712273642}
2. ONNX
	{'f1': 0.8576998050682261, 'precision': 0.831758034026465, 'recall': 0.8853118712273642}
3. ONNX Quant
	{'f1': 0.7473598700243704, 'precision': 0.6267029972752044, 'recall': 0.9255533199195171}

What I would expect to happen is the scores only change by at most 1 or so points after quantization.  The above code will reproduce these scores.  It also occurs on larger LaBSE models, such as 'pvl/labse_bert' and on token classification tasks.

For comparison, here is the output using distilbert-base-uncased.  No dramatic score changes.

* distilbert-base-uncased
1. PyTorch
	{'eval_f1': 0.8359683794466404, 'eval_precision': 0.8213592233009709, 'eval_recall': 0.8511066398390342}
2. ONNX
	{'f1': 0.8359683794466404, 'precision': 0.8213592233009709, 'recall': 0.8511066398390342}
3. ONNX Quant
	{'f1': 0.8288822947576657, 'precision': 0.8151750972762646, 'recall': 0.8430583501006036}

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:6 (3 by maintainers)

github_iconTop GitHub Comments

1reaction
kongyuruicommented, Jun 2, 2022

Investigating this deeper, I found the problem was due to a small set of MatMul nodes which I found by excluding them one by one and checking the resulting effect on scores.

1reaction
LysandreJikcommented, Apr 21, 2022

Moving that issue to optimum!

Read more comments on GitHub >

github_iconTop Results From Across the Web

Quantizing CLIP with ONNX Pt. 1: Smaller, Faster, Feasible?
This quantization will reduce the precision of the model's weights, shrinking the saved file to 25% of its original file size and providing...
Read more >
Quantize ONNX Models | onnxruntime
There are two ways to represent quantized ONNX models: ... works best with transformer based models, and ONNX shape inference works with other...
Read more >
Faster and smaller quantized NLP with Hugging Face and ...
Compared to PyTorch quantization, even with a smaller model, ONNX Runtime quantization showed the same accuracy and a slightly higher F1 score.
Read more >
Inference pipelines - Hugging Face
The pipeline() function makes it simple to use models from the Model Hub for accelerated inference on a variety of tasks such as...
Read more >
Achieving FP32 Accuracy for INT8 Inference Using ...
Model quantization is a popular deep learning optimization method in ... in the quantized tensor, but it also means using a lower precision...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found