question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Problem loading a dynamic quantized distilbert model.

See original GitHub issue

Hello and thanks for your awesome library,

Environment info

  • transformers version: 3.0.2
  • Platform: Linux-4.15.0-117-generic-x86_64-with-Ubuntu-18.04-bionic
  • Python version: 3.6.9
  • PyTorch version (GPU?): 1.6.0 (False)
  • Using GPU in script?: No
  • Using distributed or parallel set-up in script?: No

Who can help

@VictorSanh @stefan-it

Information

I’m trying to optimize a fine-tuned (for token classification, NER) distilBert model through Dynamic Quantization. I use this line:

quantized_model = torch.quantization.quantize_dynamic(model, {torch.nn.Linear}, dtype=torch.qint8)

The model size goes from: 540 MB to 411 MB.

The quantized model works fine when I use it straight away in the script to make predictions, however I’m having trouble saving it and reloading it. I tried few things, first using save_pretrained:

quantized_model.save_pretrained(quantized_output_dir)

And then loading it using:

model = AutoModelForTokenClassification.from_pretrained(quantized_output_dir)

When I use it to make predictions, I get the warning: Some weights of the model checkpoint at data/model3/quantized3/ were not used when initializing DistilBertForTokenClassification: ['distilbert.transformer.layer.0.attention.q_lin.scale', 'distilbert.transformer.layer.0.attention.q_lin.zero_point', 'distilbert.transformer.layer.0.attention.q_lin._packed_params.dtype', 'distilbert.transformer.layer.0.attention.q_lin._packed_params._packed_params', 'distilbert.transformer.layer.0.attention.k_lin.scale', 'distilbert.transformer.layer.0.attention.k_lin.zero_point', 'distilbert.transformer.layer.0.attention.k_lin._packed_params.dtype', 'distilbert.transformer.layer.0.attention.k_lin._packed_params._packed_params', 'distilbert.transformer.layer.0.attention.v_lin.scale', 'distilbert.transformer.layer.0.attention.v_lin.zero_point', 'distilbert.transformer.layer.0.attention.v_lin._packed_params.dtype', 'distilbert.transformer.layer.0.attention.v_lin._packed_params._packed_params', 'distilbert.transformer.layer.0.attention.out_lin.scale', 'distilbert.transformer.layer.0.attention.out_lin.zero_point', 'distilbert.transformer.layer.0.attention.out_lin._packed_params.dtype', 'distilbert.transformer.layer.0.attention.out_lin._packed_params._packed_params', 'distilbert.transformer.layer.0.ffn.lin1.scale', 'distilbert.transformer.layer.0.ffn.lin1.zero_point', 'distilbert.transformer.layer.0.ffn.lin1._packed_params.dtype', 'distilbert.transformer.layer.0.ffn.lin1._packed_params._packed_params', 'distilbert.transformer.layer.0.ffn.lin2.scale', 'distilbert.transformer.layer.0.ffn.lin2.zero_point', 'distilbert.transformer.layer.0.ffn.lin2._packed_params.dtype', 'distilbert.transformer.layer.0.ffn.lin2._packed_params._packed_params', 'distilbert.transformer.layer.1.attention.q_lin.scale',

For all the layers. And of course I got wrong predictions because it’s as if the model isn’t fine-tuned.

I tried saving it using:

torch.save(quantized_model.state_dict(), path)

loading it using:

config = DistilBertConfig.from_pretrained("distilbert-base-multilingual-cased", num_labels=5)
model = DistilBertForTokenClassification.from_pretrained("distilbert-base-multilingual-cased", config=config)
model.load_state_dict(torch.load(path))

and I got this runtime error: RuntimeError: Error(s) in loading state_dict for DistilBertForTokenClassification: Missing key(s) in state_dict: "distilbert.transformer.layer.0.attention.q_lin.weight", "distilbert.transformer.layer.0.attention.q_lin.bias", "distilbert.transformer.layer.0.attention.k_lin.weight", "distilbert.transformer.layer.0.attention.k_lin.bias", "distilbert.transformer.layer.0.attention.v_lin.weight", "distilbert.transformer.layer.0.attention.v_lin.bias", "distilbert.transformer.layer.0.attention.out_lin.weight", "distilbert.transformer.layer.0.attention.out_lin.bias", "distilbert.transformer.layer.0.ffn.lin1.weight", "distilbert.transformer.layer.0.ffn.lin1.bias", "distilbert.transformer.layer.0.ffn.lin2.weight", "distilbert.transformer.layer.0.ffn.lin2.bias", "distilbert.transformer.layer.1.attention.q_lin.weight", Unexpected key(s) in state_dict: "distilbert.transformer.layer.0.attention.q_lin.scale", "distilbert.transformer.layer.0.attention.q_lin.zero_point", "distilbert.transformer.layer.0.attention.q_lin._packed_params.dtype", "distilbert.transformer.layer.0.attention.q_lin._packed_params._packed_params", "distilbert.transformer.layer.0.attention.k_lin.scale", "distilbert.transformer.layer.0.attention.k_lin.zero_point", "distilbert.transformer.layer.0.attention.k_lin._packed_params.dtype", "distilbert.transformer.layer.0.attention.k_lin._packed_params._packed_params", "distilbert.transformer.layer.0.attention.v_lin.scale", "distilbert.transformer.layer.0.attention.v_lin.zero_point", "distilbert.transformer.layer.0.attention.v_lin._packed_params.dtype", "distilbert.transformer.layer.0.attention.v_lin._packed_params._packed_params", "distilbert.transformer.layer.0.attention.out_lin.scale", "distilbert.transformer.layer.0.attention.out_lin.zero_point", "distilbert.transformer.layer.0.attention.out_lin._packed_params.dtype", "distilbert.transformer.layer.0.attention.out_lin._packed_params._packed_params", "distilbert.transformer.layer.0.ffn.lin1.scale", "distilbert.transformer.layer.0.ffn.lin1.zero_point", "distilbert.transformer.layer.0.ffn.lin1._packed_params.dtype", "distilbert.transformer.layer.0.ffn.lin1._packed_params._packed_params", "distilbert.transformer.layer.0.ffn.lin2.scale", "distilbert.transformer.layer.0.ffn.lin2.zero_point", "distilbert.transformer.layer.0.ffn.lin2._packed_params.dtype", "distilbert.transformer.layer.0.ffn.lin2._packed_params._packed_params", "distilbert.transformer.layer.1.attention.q_lin.scale", "classifier._packed_params.dtype", "classifier._packed_params._packed_params". For all the layers also (didn’t put it all to shorten the text).

Here is the text when printing the quantized model:

DistilBertForTokenClassification(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(119547, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0): TransformerBlock(
          (attention): MultiHeadSelfAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): DynamicQuantizedLinear(in_features=768, out_features=768, dtype=torch.qint8, qscheme=torch.per_tensor_affine)
            (k_lin): DynamicQuantizedLinear(in_features=768, out_features=768, dtype=torch.qint8, qscheme=torch.per_tensor_affine)
            (v_lin): DynamicQuantizedLinear(in_features=768, out_features=768, dtype=torch.qint8, qscheme=torch.per_tensor_affine)
            (out_lin): DynamicQuantizedLinear(in_features=768, out_features=768, dtype=torch.qint8, qscheme=torch.per_tensor_affine)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)
            (lin1): DynamicQuantizedLinear(in_features=768, out_features=3072, dtype=torch.qint8, qscheme=torch.per_tensor_affine)
            (lin2): DynamicQuantizedLinear(in_features=3072, out_features=768, dtype=torch.qint8, qscheme=torch.per_tensor_affine)
          )
          (output_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
        )
        (1): TransformerBlock(
          (attention): MultiHeadSelfAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): DynamicQuantizedLinear(in_features=768, out_features=768, dtype=torch.qint8, qscheme=torch.per_tensor_affine)
            (k_lin): DynamicQuantizedLinear(in_features=768, out_features=768, dtype=torch.qint8, qscheme=torch.per_tensor_affine)
            (v_lin): DynamicQuantizedLinear(in_features=768, out_features=768, dtype=torch.qint8, qscheme=torch.per_tensor_affine)
            (out_lin): DynamicQuantizedLinear(in_features=768, out_features=768, dtype=torch.qint8, qscheme=torch.per_tensor_affine)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)
            (lin1): DynamicQuantizedLinear(in_features=768, out_features=3072, dtype=torch.qint8, qscheme=torch.per_tensor_affine)
            (lin2): DynamicQuantizedLinear(in_features=3072, out_features=768, dtype=torch.qint8, qscheme=torch.per_tensor_affine)
          )
          (output_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
        )
        (2): TransformerBlock(
          (attention): MultiHeadSelfAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): DynamicQuantizedLinear(in_features=768, out_features=768, dtype=torch.qint8, qscheme=torch.per_tensor_affine)
            (k_lin): DynamicQuantizedLinear(in_features=768, out_features=768, dtype=torch.qint8, qscheme=torch.per_tensor_affine)
            (v_lin): DynamicQuantizedLinear(in_features=768, out_features=768, dtype=torch.qint8, qscheme=torch.per_tensor_affine)
            (out_lin): DynamicQuantizedLinear(in_features=768, out_features=768, dtype=torch.qint8, qscheme=torch.per_tensor_affine)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)
            (lin1): DynamicQuantizedLinear(in_features=768, out_features=3072, dtype=torch.qint8, qscheme=torch.per_tensor_affine)
            (lin2): DynamicQuantizedLinear(in_features=3072, out_features=768, dtype=torch.qint8, qscheme=torch.per_tensor_affine)
          )
          (output_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
        )
        (3): TransformerBlock(
          (attention): MultiHeadSelfAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): DynamicQuantizedLinear(in_features=768, out_features=768, dtype=torch.qint8, qscheme=torch.per_tensor_affine)
            (k_lin): DynamicQuantizedLinear(in_features=768, out_features=768, dtype=torch.qint8, qscheme=torch.per_tensor_affine)
            (v_lin): DynamicQuantizedLinear(in_features=768, out_features=768, dtype=torch.qint8, qscheme=torch.per_tensor_affine)
            (out_lin): DynamicQuantizedLinear(in_features=768, out_features=768, dtype=torch.qint8, qscheme=torch.per_tensor_affine)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)
            (lin1): DynamicQuantizedLinear(in_features=768, out_features=3072, dtype=torch.qint8, qscheme=torch.per_tensor_affine)
            (lin2): DynamicQuantizedLinear(in_features=3072, out_features=768, dtype=torch.qint8, qscheme=torch.per_tensor_affine)
          )
          (output_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
        )
        (4): TransformerBlock(
          (attention): MultiHeadSelfAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): DynamicQuantizedLinear(in_features=768, out_features=768, dtype=torch.qint8, qscheme=torch.per_tensor_affine)
            (k_lin): DynamicQuantizedLinear(in_features=768, out_features=768, dtype=torch.qint8, qscheme=torch.per_tensor_affine)
            (v_lin): DynamicQuantizedLinear(in_features=768, out_features=768, dtype=torch.qint8, qscheme=torch.per_tensor_affine)
            (out_lin): DynamicQuantizedLinear(in_features=768, out_features=768, dtype=torch.qint8, qscheme=torch.per_tensor_affine)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)
            (lin1): DynamicQuantizedLinear(in_features=768, out_features=3072, dtype=torch.qint8, qscheme=torch.per_tensor_affine)
            (lin2): DynamicQuantizedLinear(in_features=3072, out_features=768, dtype=torch.qint8, qscheme=torch.per_tensor_affine)
          )
          (output_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
        )
        (5): TransformerBlock(
          (attention): MultiHeadSelfAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): DynamicQuantizedLinear(in_features=768, out_features=768, dtype=torch.qint8, qscheme=torch.per_tensor_affine)
            (k_lin): DynamicQuantizedLinear(in_features=768, out_features=768, dtype=torch.qint8, qscheme=torch.per_tensor_affine)
            (v_lin): DynamicQuantizedLinear(in_features=768, out_features=768, dtype=torch.qint8, qscheme=torch.per_tensor_affine)
            (out_lin): DynamicQuantizedLinear(in_features=768, out_features=768, dtype=torch.qint8, qscheme=torch.per_tensor_affine)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)
            (lin1): DynamicQuantizedLinear(in_features=768, out_features=3072, dtype=torch.qint8, qscheme=torch.per_tensor_affine)
            (lin2): DynamicQuantizedLinear(in_features=3072, out_features=768, dtype=torch.qint8, qscheme=torch.per_tensor_affine)
          )
          (output_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
        )
      )
    )
  )
  (dropout): Dropout(p=0.1, inplace=False)
  (classifier): DynamicQuantizedLinear(in_features=768, out_features=5, dtype=torch.qint8, qscheme=torch.per_tensor_affine)
)

Expected behavior

You can successfully load the quantized fine-tuned model to make predictions. Can be the “DynamicQuantizedLinear” instead of “Linear” be causing this problem ?

Thanks in advance for your help.

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:6 (2 by maintainers)

github_iconTop GitHub Comments

1reaction
VictorSanhcommented, Nov 11, 2020

It is a matter of adding a few lines:

# Transform your model into a quantized model
quantized_model = torch.quantization.quantize_dynamic(model, {torch.nn.Linear}, dtype=torch.qint8)
# Load the quantized weights into the quantized model (module in torch)
quantized_model.load_state_dict(torch.load(YOUR_PATH_TO_THE_QUANTIZED_WEIGHTS))
1reaction
VictorSanhcommented, Sep 26, 2020

You are trying to load into a not-quantized module (ModelForTokenClassification) some quantized weights (quantized_model = torch.quantization.quantize_dynamic(model, {torch.nn.Linear}, dtype=torch.qint8)) You should make sure first that the instance you are loading into is actually a quantized model.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Dynamic quantization problems - Optimum
It seems like the model-quantized.onnx is exported without weights… If I load the model.onnx , the accuracy back to normal. Is there something...
Read more >
(beta) Dynamic Quantization on BERT - PyTorch
In this tutorial, we will apply the dynamic quantization on a BERT model, closely following the BERT model from the HuggingFace Transformers examples....
Read more >
(prototype) Graph Mode Dynamic Quantization on BERT
This tutorial introduces the steps to do post training Dynamic Quantization with Graph Mode Quantization. Dynamic quantization converts a float model to a ......
Read more >
(experimental) Dynamic Quantization on BERT - Google Colab
Dynamic quantization support in PyTorch converts a float model to a quantized model with static ... Local MRPC data not specified, downloading data...
Read more >
Static Quantization with Hugging Face `optimum` for
The session will show you how to quantize a DistilBERT model using ... static quantization, compared to dynamic quantization not only ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found