Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Quantizing ONNX text classification model based on "setu4993/smaller-LaBSE" causes much lower precision scores

See original GitHub issue

System Info

- `transformers` version: 4.18.0
- Platform: Linux-5.13.0-39-generic-x86_64-with-glibc2.29
- Python version: 3.8.10
- Huggingface_hub version: 0.5.1
- PyTorch version (GPU?): 1.11.0+cu102 (True)
- Tensorflow version (GPU?): 2.3.0 (False)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using GPU in script?: yes
- Using distributed or parallel set-up in script?: no
 
optimum==1.1.0

Who can help?

@LysandreJik since LaBSE is based on BERT

Not sure who to ping for optimum.

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, …)
My own task or dataset (give details below)

Reproduction

The following code trains a simple binary text classification model using pytorch, then converts to ONNX and then quantizes the ONNX model, printing evaluation results for each. It first fine-tunes on “distilbert-base-uncased” and then “setu4993/smaller-LaBSE”. Scores don’t change much for the three model version on distilbert, but quantization lowers scores on LaBSE.

import os
import shutil

import numpy as np
import torch
from datasets import load_dataset, Dataset, load_metric
from optimum.onnxruntime.configuration import AutoQuantizationConfig
from optimum.onnxruntime import ORTQuantizer, ORTModel
from sklearn.metrics import precision_recall_fscore_support
from transformers import AutoTokenizer, AutoModelForSequenceClassification, DataCollatorWithPadding, TrainingArguments, Trainer


os.environ["WANDB_DISABLED"] = "true"

imdb = load_dataset("imdb").shuffle()

# Taking a subset to speed up training/testing times, same effect occurs on full dataset
imdb['train'] = Dataset.from_dict(imdb['train'][:1000])
imdb['test'] = Dataset.from_dict(imdb['test'][:1000])

metric = load_metric('f1')


def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    metrics = metric.compute(predictions=predictions, references=labels)
    prec_rec = precision_recall_fscore_support(labels, predictions, average='binary')
    metrics['precision'] = prec_rec[0]
    metrics['recall'] = prec_rec[1]
    return metrics


def train_eval_demo(model_name):
    """
    Train a simple binary text classification model using the IMDB dataset
    Convert to ONNX and then quantize the ONNX model and return evaluation results
    """

    tokenizer = AutoTokenizer.from_pretrained(model_name)

    def preprocess_function(examples):
        return tokenizer(examples["text"], truncation=True, max_length=128)

    tokenized_imdb = imdb.map(preprocess_function, batched=True)

    data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

    model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)

    output_dir = "model_debug"
    if os.path.exists(output_dir):
        shutil.rmtree(output_dir)

    training_args = TrainingArguments(
        output_dir=f"./{output_dir}",
        learning_rate=2e-5,
        per_device_train_batch_size=8,
        per_device_eval_batch_size=8,
        num_train_epochs=5,
        weight_decay=0.01,
        overwrite_output_dir=True,
        no_cuda=not torch.cuda.is_available(),
        save_steps=1000,
    )

    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=tokenized_imdb["train"],
        eval_dataset=tokenized_imdb["test"],
        tokenizer=tokenizer,
        data_collator=data_collator,
        compute_metrics=compute_metrics,
    )

    trainer.train(resume_from_checkpoint=False)
    trainer.save_model()

    metrics = trainer.evaluate(metric_key_prefix="eval")

    results = {}
    results["1. PyTorch"] = metrics

    # The model we wish to quantize
    # The type of quantization to apply
    qconfig = AutoQuantizationConfig.arm64(is_static=False, per_channel=False)
    quantizer = ORTQuantizer.from_pretrained(output_dir, feature="sequence-classification")

    # Quantize the model!
    quantizer.export(
        onnx_model_path=f"{output_dir}/onnx_model.onnx",
        onnx_quantized_model_output_path=f"{output_dir}/onnx_model-quantized.onnx",
        quantization_config=qconfig,
    )

    ort_model = ORTModel(f"{output_dir}/onnx_model.onnx", quantizer._onnx_config)
    ort_outputs = ort_model.evaluation_loop(tokenized_imdb["test"])
    onnx_metrics = compute_metrics((ort_outputs.predictions, tokenized_imdb["test"]["label"]))
    results["2. ONNX"] = onnx_metrics

    ort_model = ORTModel(f"{output_dir}/onnx_model-quantized.onnx", quantizer._onnx_config)
    ort_outputs = ort_model.evaluation_loop(tokenized_imdb["test"])
    onnx_quant_metrics = compute_metrics((ort_outputs.predictions, tokenized_imdb["test"]["label"]))
    results["3. ONNX Quant"] = onnx_quant_metrics

    return results


# Models to compare
model_names = ["distilbert-base-uncased", "setu4993/smaller-LaBSE"]  #
model_scores = {}

for model_name in model_names:
    results = train_eval_demo(model_name)
    model_scores[model_name] = results

for model_name in model_names:
    print("* " + model_name)
    for iteration in sorted(model_scores[model_name]):
        print(iteration)
        print(f"\t{model_scores[model_name][iteration]}")
    print()

Expected behavior

I'm building text classification models based on LaBSE (specifically the smaller version "setu4993/smaller-LaBSE").  After conversion to ONNX, the f1, precision and recall values are the same.  However, after quantization scores drop a lot (precision 83.2 to 62.7):

* setu4993/smaller-LaBSE
1. PyTorch
	{'eval_f1': 0.8576998050682261, 'eval_precision': 0.831758034026465, 'eval_recall': 0.8853118712273642}
2. ONNX
	{'f1': 0.8576998050682261, 'precision': 0.831758034026465, 'recall': 0.8853118712273642}
3. ONNX Quant
	{'f1': 0.7473598700243704, 'precision': 0.6267029972752044, 'recall': 0.9255533199195171}

What I would expect to happen is the scores only change by at most 1 or so points after quantization.  The above code will reproduce these scores.  It also occurs on larger LaBSE models, such as 'pvl/labse_bert' and on token classification tasks.

For comparison, here is the output using distilbert-base-uncased.  No dramatic score changes.

* distilbert-base-uncased
1. PyTorch
	{'eval_f1': 0.8359683794466404, 'eval_precision': 0.8213592233009709, 'eval_recall': 0.8511066398390342}
2. ONNX
	{'f1': 0.8359683794466404, 'precision': 0.8213592233009709, 'recall': 0.8511066398390342}
3. ONNX Quant
	{'f1': 0.8288822947576657, 'precision': 0.8151750972762646, 'recall': 0.8430583501006036}

Issue Analytics

State:
Created a year ago
Comments:6 (3 by maintainers)

Top GitHub Comments

1reaction

kongyuruicommented, Jun 2, 2022

Investigating this deeper, I found the problem was due to a small set of MatMul nodes which I found by excluding them one by one and checking the resulting effect on scores.

1reaction

LysandreJikcommented, Apr 21, 2022

Moving that issue to optimum!