Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Problem loading a dynamic quantized distilbert model.

See original GitHub issue

Hello and thanks for your awesome library,

Environment info

transformers version: 3.0.2
Platform: Linux-4.15.0-117-generic-x86_64-with-Ubuntu-18.04-bionic
Python version: 3.6.9
PyTorch version (GPU?): 1.6.0 (False)
Using GPU in script?: No
Using distributed or parallel set-up in script?: No

Who can help

@VictorSanh @stefan-it

Information

I’m trying to optimize a fine-tuned (for token classification, NER) distilBert model through Dynamic Quantization. I use this line:

quantized_model = torch.quantization.quantize_dynamic(model, {torch.nn.Linear}, dtype=torch.qint8)

The model size goes from: 540 MB to 411 MB.

The quantized model works fine when I use it straight away in the script to make predictions, however I’m having trouble saving it and reloading it. I tried few things, first using save_pretrained:

quantized_model.save_pretrained(quantized_output_dir)

And then loading it using:

model = AutoModelForTokenClassification.from_pretrained(quantized_output_dir)

When I use it to make predictions, I get the warning: Some weights of the model checkpoint at data/model3/quantized3/ were not used when initializing DistilBertForTokenClassification: ['distilbert.transformer.layer.0.attention.q_lin.scale', 'distilbert.transformer.layer.0.attention.q_lin.zero_point', 'distilbert.transformer.layer.0.attention.q_lin._packed_params.dtype', 'distilbert.transformer.layer.0.attention.q_lin._packed_params._packed_params', 'distilbert.transformer.layer.0.attention.k_lin.scale', 'distilbert.transformer.layer.0.attention.k_lin.zero_point', 'distilbert.transformer.layer.0.attention.k_lin._packed_params.dtype', 'distilbert.transformer.layer.0.attention.k_lin._packed_params._packed_params', 'distilbert.transformer.layer.0.attention.v_lin.scale', 'distilbert.transformer.layer.0.attention.v_lin.zero_point', 'distilbert.transformer.layer.0.attention.v_lin._packed_params.dtype', 'distilbert.transformer.layer.0.attention.v_lin._packed_params._packed_params', 'distilbert.transformer.layer.0.attention.out_lin.scale', 'distilbert.transformer.layer.0.attention.out_lin.zero_point', 'distilbert.transformer.layer.0.attention.out_lin._packed_params.dtype', 'distilbert.transformer.layer.0.attention.out_lin._packed_params._packed_params', 'distilbert.transformer.layer.0.ffn.lin1.scale', 'distilbert.transformer.layer.0.ffn.lin1.zero_point', 'distilbert.transformer.layer.0.ffn.lin1._packed_params.dtype', 'distilbert.transformer.layer.0.ffn.lin1._packed_params._packed_params', 'distilbert.transformer.layer.0.ffn.lin2.scale', 'distilbert.transformer.layer.0.ffn.lin2.zero_point', 'distilbert.transformer.layer.0.ffn.lin2._packed_params.dtype', 'distilbert.transformer.layer.0.ffn.lin2._packed_params._packed_params', 'distilbert.transformer.layer.1.attention.q_lin.scale',

For all the layers. And of course I got wrong predictions because it’s as if the model isn’t fine-tuned.

I tried saving it using:

torch.save(quantized_model.state_dict(), path)

loading it using:

config = DistilBertConfig.from_pretrained("distilbert-base-multilingual-cased", num_labels=5)
model = DistilBertForTokenClassification.from_pretrained("distilbert-base-multilingual-cased", config=config)
model.load_state_dict(torch.load(path))

and I got this runtime error: RuntimeError: Error(s) in loading state_dict for DistilBertForTokenClassification: Missing key(s) in state_dict: "distilbert.transformer.layer.0.attention.q_lin.weight", "distilbert.transformer.layer.0.attention.q_lin.bias", "distilbert.transformer.layer.0.attention.k_lin.weight", "distilbert.transformer.layer.0.attention.k_lin.bias", "distilbert.transformer.layer.0.attention.v_lin.weight", "distilbert.transformer.layer.0.attention.v_lin.bias", "distilbert.transformer.layer.0.attention.out_lin.weight", "distilbert.transformer.layer.0.attention.out_lin.bias", "distilbert.transformer.layer.0.ffn.lin1.weight", "distilbert.transformer.layer.0.ffn.lin1.bias", "distilbert.transformer.layer.0.ffn.lin2.weight", "distilbert.transformer.layer.0.ffn.lin2.bias", "distilbert.transformer.layer.1.attention.q_lin.weight", Unexpected key(s) in state_dict: "distilbert.transformer.layer.0.attention.q_lin.scale", "distilbert.transformer.layer.0.attention.q_lin.zero_point", "distilbert.transformer.layer.0.attention.q_lin._packed_params.dtype", "distilbert.transformer.layer.0.attention.q_lin._packed_params._packed_params", "distilbert.transformer.layer.0.attention.k_lin.scale", "distilbert.transformer.layer.0.attention.k_lin.zero_point", "distilbert.transformer.layer.0.attention.k_lin._packed_params.dtype", "distilbert.transformer.layer.0.attention.k_lin._packed_params._packed_params", "distilbert.transformer.layer.0.attention.v_lin.scale", "distilbert.transformer.layer.0.attention.v_lin.zero_point", "distilbert.transformer.layer.0.attention.v_lin._packed_params.dtype", "distilbert.transformer.layer.0.attention.v_lin._packed_params._packed_params", "distilbert.transformer.layer.0.attention.out_lin.scale", "distilbert.transformer.layer.0.attention.out_lin.zero_point", "distilbert.transformer.layer.0.attention.out_lin._packed_params.dtype", "distilbert.transformer.layer.0.attention.out_lin._packed_params._packed_params", "distilbert.transformer.layer.0.ffn.lin1.scale", "distilbert.transformer.layer.0.ffn.lin1.zero_point", "distilbert.transformer.layer.0.ffn.lin1._packed_params.dtype", "distilbert.transformer.layer.0.ffn.lin1._packed_params._packed_params", "distilbert.transformer.layer.0.ffn.lin2.scale", "distilbert.transformer.layer.0.ffn.lin2.zero_point", "distilbert.transformer.layer.0.ffn.lin2._packed_params.dtype", "distilbert.transformer.layer.0.ffn.lin2._packed_params._packed_params", "distilbert.transformer.layer.1.attention.q_lin.scale", "classifier._packed_params.dtype", "classifier._packed_params._packed_params". For all the layers also (didn’t put it all to shorten the text).

Here is the text when printing the quantized model:

DistilBertForTokenClassification(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(119547, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0): TransformerBlock(
          (attention): MultiHeadSelfAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): DynamicQuantizedLinear(in_features=768, out_features=768, dtype=torch.qint8, qscheme=torch.per_tensor_affine)
            (k_lin): DynamicQuantizedLinear(in_features=768, out_features=768, dtype=torch.qint8, qscheme=torch.per_tensor_affine)
            (v_lin): DynamicQuantizedLinear(in_features=768, out_features=768, dtype=torch.qint8, qscheme=torch.per_tensor_affine)
            (out_lin): DynamicQuantizedLinear(in_features=768, out_features=768, dtype=torch.qint8, qscheme=torch.per_tensor_affine)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)
            (lin1): DynamicQuantizedLinear(in_features=768, out_features=3072, dtype=torch.qint8, qscheme=torch.per_tensor_affine)
            (lin2): DynamicQuantizedLinear(in_features=3072, out_features=768, dtype=torch.qint8, qscheme=torch.per_tensor_affine)
          )
          (output_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
        )
        (1): TransformerBlock(
          (attention): MultiHeadSelfAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): DynamicQuantizedLinear(in_features=768, out_features=768, dtype=torch.qint8, qscheme=torch.per_tensor_affine)
            (k_lin): DynamicQuantizedLinear(in_features=768, out_features=768, dtype=torch.qint8, qscheme=torch.per_tensor_affine)
            (v_lin): DynamicQuantizedLinear(in_features=768, out_features=768, dtype=torch.qint8, qscheme=torch.per_tensor_affine)
            (out_lin): DynamicQuantizedLinear(in_features=768, out_features=768, dtype=torch.qint8, qscheme=torch.per_tensor_affine)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)
            (lin1): DynamicQuantizedLinear(in_features=768, out_features=3072, dtype=torch.qint8, qscheme=torch.per_tensor_affine)
            (lin2): DynamicQuantizedLinear(in_features=3072, out_features=768, dtype=torch.qint8, qscheme=torch.per_tensor_affine)
          )
          (output_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
        )
        (2): TransformerBlock(
          (attention): MultiHeadSelfAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): DynamicQuantizedLinear(in_features=768, out_features=768, dtype=torch.qint8, qscheme=torch.per_tensor_affine)
            (k_lin): DynamicQuantizedLinear(in_features=768, out_features=768, dtype=torch.qint8, qscheme=torch.per_tensor_affine)
            (v_lin): DynamicQuantizedLinear(in_features=768, out_features=768, dtype=torch.qint8, qscheme=torch.per_tensor_affine)
            (out_lin): DynamicQuantizedLinear(in_features=768, out_features=768, dtype=torch.qint8, qscheme=torch.per_tensor_affine)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)
            (lin1): DynamicQuantizedLinear(in_features=768, out_features=3072, dtype=torch.qint8, qscheme=torch.per_tensor_affine)
            (lin2): DynamicQuantizedLinear(in_features=3072, out_features=768, dtype=torch.qint8, qscheme=torch.per_tensor_affine)
          )
          (output_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
        )
        (3): TransformerBlock(
          (attention): MultiHeadSelfAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): DynamicQuantizedLinear(in_features=768, out_features=768, dtype=torch.qint8, qscheme=torch.per_tensor_affine)
            (k_lin): DynamicQuantizedLinear(in_features=768, out_features=768, dtype=torch.qint8, qscheme=torch.per_tensor_affine)
            (v_lin): DynamicQuantizedLinear(in_features=768, out_features=768, dtype=torch.qint8, qscheme=torch.per_tensor_affine)
            (out_lin): DynamicQuantizedLinear(in_features=768, out_features=768, dtype=torch.qint8, qscheme=torch.per_tensor_affine)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)
            (lin1): DynamicQuantizedLinear(in_features=768, out_features=3072, dtype=torch.qint8, qscheme=torch.per_tensor_affine)
            (lin2): DynamicQuantizedLinear(in_features=3072, out_features=768, dtype=torch.qint8, qscheme=torch.per_tensor_affine)
          )
          (output_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
        )
        (4): TransformerBlock(
          (attention): MultiHeadSelfAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): DynamicQuantizedLinear(in_features=768, out_features=768, dtype=torch.qint8, qscheme=torch.per_tensor_affine)
            (k_lin): DynamicQuantizedLinear(in_features=768, out_features=768, dtype=torch.qint8, qscheme=torch.per_tensor_affine)
            (v_lin): DynamicQuantizedLinear(in_features=768, out_features=768, dtype=torch.qint8, qscheme=torch.per_tensor_affine)
            (out_lin): DynamicQuantizedLinear(in_features=768, out_features=768, dtype=torch.qint8, qscheme=torch.per_tensor_affine)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)
            (lin1): DynamicQuantizedLinear(in_features=768, out_features=3072, dtype=torch.qint8, qscheme=torch.per_tensor_affine)
            (lin2): DynamicQuantizedLinear(in_features=3072, out_features=768, dtype=torch.qint8, qscheme=torch.per_tensor_affine)
          )
          (output_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
        )
        (5): TransformerBlock(
          (attention): MultiHeadSelfAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): DynamicQuantizedLinear(in_features=768, out_features=768, dtype=torch.qint8, qscheme=torch.per_tensor_affine)
            (k_lin): DynamicQuantizedLinear(in_features=768, out_features=768, dtype=torch.qint8, qscheme=torch.per_tensor_affine)
            (v_lin): DynamicQuantizedLinear(in_features=768, out_features=768, dtype=torch.qint8, qscheme=torch.per_tensor_affine)
            (out_lin): DynamicQuantizedLinear(in_features=768, out_features=768, dtype=torch.qint8, qscheme=torch.per_tensor_affine)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)
            (lin1): DynamicQuantizedLinear(in_features=768, out_features=3072, dtype=torch.qint8, qscheme=torch.per_tensor_affine)
            (lin2): DynamicQuantizedLinear(in_features=3072, out_features=768, dtype=torch.qint8, qscheme=torch.per_tensor_affine)
          )
          (output_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
        )
      )
    )
  )
  (dropout): Dropout(p=0.1, inplace=False)
  (classifier): DynamicQuantizedLinear(in_features=768, out_features=5, dtype=torch.qint8, qscheme=torch.per_tensor_affine)
)

Expected behavior

You can successfully load the quantized fine-tuned model to make predictions. Can be the “DynamicQuantizedLinear” instead of “Linear” be causing this problem ?

Thanks in advance for your help.

Issue Analytics

State:
Created 3 years ago
Comments:6 (2 by maintainers)

Top GitHub Comments

1reaction

VictorSanhcommented, Nov 11, 2020

It is a matter of adding a few lines:

# Transform your model into a quantized model
quantized_model = torch.quantization.quantize_dynamic(model, {torch.nn.Linear}, dtype=torch.qint8)
# Load the quantized weights into the quantized model (module in torch)
quantized_model.load_state_dict(torch.load(YOUR_PATH_TO_THE_QUANTIZED_WEIGHTS))

1reaction

VictorSanhcommented, Sep 26, 2020

You are trying to load into a not-quantized module (ModelForTokenClassification) some quantized weights (quantized_model = torch.quantization.quantize_dynamic(model, {torch.nn.Linear}, dtype=torch.qint8)) You should make sure first that the instance you are loading into is actually a quantized model.

Top Results From Across the Web

Dynamic quantization problems - Optimum

It seems like the model-quantized.onnx is exported without weights… If I load the model.onnx , the accuracy back to normal. Is there something...

(beta) Dynamic Quantization on BERT - PyTorch

In this tutorial, we will apply the dynamic quantization on a BERT model, closely following the BERT model from the HuggingFace Transformers examples....

(prototype) Graph Mode Dynamic Quantization on BERT

This tutorial introduces the steps to do post training Dynamic Quantization with Graph Mode Quantization. Dynamic quantization converts a float model to a ......

(experimental) Dynamic Quantization on BERT - Google Colab

Dynamic quantization support in PyTorch converts a float model to a quantized model with static ... Local MRPC data not specified, downloading data...

Static Quantization with Hugging Face `optimum` for

The session will show you how to quantize a DistilBERT model using ... static quantization, compared to dynamic quantization not only ...