Problem loading a dynamic quantized distilbert model.
See original GitHub issueHello and thanks for your awesome library,
Environment info
transformers
version: 3.0.2- Platform: Linux-4.15.0-117-generic-x86_64-with-Ubuntu-18.04-bionic
- Python version: 3.6.9
- PyTorch version (GPU?): 1.6.0 (False)
- Using GPU in script?: No
- Using distributed or parallel set-up in script?: No
Who can help
Information
I’m trying to optimize a fine-tuned (for token classification, NER) distilBert model through Dynamic Quantization. I use this line:
quantized_model = torch.quantization.quantize_dynamic(model, {torch.nn.Linear}, dtype=torch.qint8)
The model size goes from: 540 MB to 411 MB.
The quantized model works fine when I use it straight away in the script to make predictions, however I’m having trouble saving it and reloading it. I tried few things, first using save_pretrained:
quantized_model.save_pretrained(quantized_output_dir)
And then loading it using:
model = AutoModelForTokenClassification.from_pretrained(quantized_output_dir)
When I use it to make predictions, I get the warning:
Some weights of the model checkpoint at data/model3/quantized3/ were not used when initializing DistilBertForTokenClassification: ['distilbert.transformer.layer.0.attention.q_lin.scale', 'distilbert.transformer.layer.0.attention.q_lin.zero_point', 'distilbert.transformer.layer.0.attention.q_lin._packed_params.dtype', 'distilbert.transformer.layer.0.attention.q_lin._packed_params._packed_params', 'distilbert.transformer.layer.0.attention.k_lin.scale', 'distilbert.transformer.layer.0.attention.k_lin.zero_point', 'distilbert.transformer.layer.0.attention.k_lin._packed_params.dtype', 'distilbert.transformer.layer.0.attention.k_lin._packed_params._packed_params', 'distilbert.transformer.layer.0.attention.v_lin.scale', 'distilbert.transformer.layer.0.attention.v_lin.zero_point', 'distilbert.transformer.layer.0.attention.v_lin._packed_params.dtype', 'distilbert.transformer.layer.0.attention.v_lin._packed_params._packed_params', 'distilbert.transformer.layer.0.attention.out_lin.scale', 'distilbert.transformer.layer.0.attention.out_lin.zero_point', 'distilbert.transformer.layer.0.attention.out_lin._packed_params.dtype', 'distilbert.transformer.layer.0.attention.out_lin._packed_params._packed_params', 'distilbert.transformer.layer.0.ffn.lin1.scale', 'distilbert.transformer.layer.0.ffn.lin1.zero_point', 'distilbert.transformer.layer.0.ffn.lin1._packed_params.dtype', 'distilbert.transformer.layer.0.ffn.lin1._packed_params._packed_params', 'distilbert.transformer.layer.0.ffn.lin2.scale', 'distilbert.transformer.layer.0.ffn.lin2.zero_point', 'distilbert.transformer.layer.0.ffn.lin2._packed_params.dtype', 'distilbert.transformer.layer.0.ffn.lin2._packed_params._packed_params', 'distilbert.transformer.layer.1.attention.q_lin.scale',
For all the layers. And of course I got wrong predictions because it’s as if the model isn’t fine-tuned.
I tried saving it using:
torch.save(quantized_model.state_dict(), path)
loading it using:
config = DistilBertConfig.from_pretrained("distilbert-base-multilingual-cased", num_labels=5)
model = DistilBertForTokenClassification.from_pretrained("distilbert-base-multilingual-cased", config=config)
model.load_state_dict(torch.load(path))
and I got this runtime error:
RuntimeError: Error(s) in loading state_dict for DistilBertForTokenClassification: Missing key(s) in state_dict: "distilbert.transformer.layer.0.attention.q_lin.weight", "distilbert.transformer.layer.0.attention.q_lin.bias", "distilbert.transformer.layer.0.attention.k_lin.weight", "distilbert.transformer.layer.0.attention.k_lin.bias", "distilbert.transformer.layer.0.attention.v_lin.weight", "distilbert.transformer.layer.0.attention.v_lin.bias", "distilbert.transformer.layer.0.attention.out_lin.weight", "distilbert.transformer.layer.0.attention.out_lin.bias", "distilbert.transformer.layer.0.ffn.lin1.weight", "distilbert.transformer.layer.0.ffn.lin1.bias", "distilbert.transformer.layer.0.ffn.lin2.weight", "distilbert.transformer.layer.0.ffn.lin2.bias", "distilbert.transformer.layer.1.attention.q_lin.weight", Unexpected key(s) in state_dict: "distilbert.transformer.layer.0.attention.q_lin.scale", "distilbert.transformer.layer.0.attention.q_lin.zero_point", "distilbert.transformer.layer.0.attention.q_lin._packed_params.dtype", "distilbert.transformer.layer.0.attention.q_lin._packed_params._packed_params", "distilbert.transformer.layer.0.attention.k_lin.scale", "distilbert.transformer.layer.0.attention.k_lin.zero_point", "distilbert.transformer.layer.0.attention.k_lin._packed_params.dtype", "distilbert.transformer.layer.0.attention.k_lin._packed_params._packed_params", "distilbert.transformer.layer.0.attention.v_lin.scale", "distilbert.transformer.layer.0.attention.v_lin.zero_point", "distilbert.transformer.layer.0.attention.v_lin._packed_params.dtype", "distilbert.transformer.layer.0.attention.v_lin._packed_params._packed_params", "distilbert.transformer.layer.0.attention.out_lin.scale", "distilbert.transformer.layer.0.attention.out_lin.zero_point", "distilbert.transformer.layer.0.attention.out_lin._packed_params.dtype", "distilbert.transformer.layer.0.attention.out_lin._packed_params._packed_params", "distilbert.transformer.layer.0.ffn.lin1.scale", "distilbert.transformer.layer.0.ffn.lin1.zero_point", "distilbert.transformer.layer.0.ffn.lin1._packed_params.dtype", "distilbert.transformer.layer.0.ffn.lin1._packed_params._packed_params", "distilbert.transformer.layer.0.ffn.lin2.scale", "distilbert.transformer.layer.0.ffn.lin2.zero_point", "distilbert.transformer.layer.0.ffn.lin2._packed_params.dtype", "distilbert.transformer.layer.0.ffn.lin2._packed_params._packed_params", "distilbert.transformer.layer.1.attention.q_lin.scale", "classifier._packed_params.dtype", "classifier._packed_params._packed_params".
For all the layers also (didn’t put it all to shorten the text).
Here is the text when printing the quantized model:
DistilBertForTokenClassification(
(distilbert): DistilBertModel(
(embeddings): Embeddings(
(word_embeddings): Embedding(119547, 768, padding_idx=0)
(position_embeddings): Embedding(512, 768)
(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
(transformer): Transformer(
(layer): ModuleList(
(0): TransformerBlock(
(attention): MultiHeadSelfAttention(
(dropout): Dropout(p=0.1, inplace=False)
(q_lin): DynamicQuantizedLinear(in_features=768, out_features=768, dtype=torch.qint8, qscheme=torch.per_tensor_affine)
(k_lin): DynamicQuantizedLinear(in_features=768, out_features=768, dtype=torch.qint8, qscheme=torch.per_tensor_affine)
(v_lin): DynamicQuantizedLinear(in_features=768, out_features=768, dtype=torch.qint8, qscheme=torch.per_tensor_affine)
(out_lin): DynamicQuantizedLinear(in_features=768, out_features=768, dtype=torch.qint8, qscheme=torch.per_tensor_affine)
)
(sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
(ffn): FFN(
(dropout): Dropout(p=0.1, inplace=False)
(lin1): DynamicQuantizedLinear(in_features=768, out_features=3072, dtype=torch.qint8, qscheme=torch.per_tensor_affine)
(lin2): DynamicQuantizedLinear(in_features=3072, out_features=768, dtype=torch.qint8, qscheme=torch.per_tensor_affine)
)
(output_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
)
(1): TransformerBlock(
(attention): MultiHeadSelfAttention(
(dropout): Dropout(p=0.1, inplace=False)
(q_lin): DynamicQuantizedLinear(in_features=768, out_features=768, dtype=torch.qint8, qscheme=torch.per_tensor_affine)
(k_lin): DynamicQuantizedLinear(in_features=768, out_features=768, dtype=torch.qint8, qscheme=torch.per_tensor_affine)
(v_lin): DynamicQuantizedLinear(in_features=768, out_features=768, dtype=torch.qint8, qscheme=torch.per_tensor_affine)
(out_lin): DynamicQuantizedLinear(in_features=768, out_features=768, dtype=torch.qint8, qscheme=torch.per_tensor_affine)
)
(sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
(ffn): FFN(
(dropout): Dropout(p=0.1, inplace=False)
(lin1): DynamicQuantizedLinear(in_features=768, out_features=3072, dtype=torch.qint8, qscheme=torch.per_tensor_affine)
(lin2): DynamicQuantizedLinear(in_features=3072, out_features=768, dtype=torch.qint8, qscheme=torch.per_tensor_affine)
)
(output_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
)
(2): TransformerBlock(
(attention): MultiHeadSelfAttention(
(dropout): Dropout(p=0.1, inplace=False)
(q_lin): DynamicQuantizedLinear(in_features=768, out_features=768, dtype=torch.qint8, qscheme=torch.per_tensor_affine)
(k_lin): DynamicQuantizedLinear(in_features=768, out_features=768, dtype=torch.qint8, qscheme=torch.per_tensor_affine)
(v_lin): DynamicQuantizedLinear(in_features=768, out_features=768, dtype=torch.qint8, qscheme=torch.per_tensor_affine)
(out_lin): DynamicQuantizedLinear(in_features=768, out_features=768, dtype=torch.qint8, qscheme=torch.per_tensor_affine)
)
(sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
(ffn): FFN(
(dropout): Dropout(p=0.1, inplace=False)
(lin1): DynamicQuantizedLinear(in_features=768, out_features=3072, dtype=torch.qint8, qscheme=torch.per_tensor_affine)
(lin2): DynamicQuantizedLinear(in_features=3072, out_features=768, dtype=torch.qint8, qscheme=torch.per_tensor_affine)
)
(output_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
)
(3): TransformerBlock(
(attention): MultiHeadSelfAttention(
(dropout): Dropout(p=0.1, inplace=False)
(q_lin): DynamicQuantizedLinear(in_features=768, out_features=768, dtype=torch.qint8, qscheme=torch.per_tensor_affine)
(k_lin): DynamicQuantizedLinear(in_features=768, out_features=768, dtype=torch.qint8, qscheme=torch.per_tensor_affine)
(v_lin): DynamicQuantizedLinear(in_features=768, out_features=768, dtype=torch.qint8, qscheme=torch.per_tensor_affine)
(out_lin): DynamicQuantizedLinear(in_features=768, out_features=768, dtype=torch.qint8, qscheme=torch.per_tensor_affine)
)
(sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
(ffn): FFN(
(dropout): Dropout(p=0.1, inplace=False)
(lin1): DynamicQuantizedLinear(in_features=768, out_features=3072, dtype=torch.qint8, qscheme=torch.per_tensor_affine)
(lin2): DynamicQuantizedLinear(in_features=3072, out_features=768, dtype=torch.qint8, qscheme=torch.per_tensor_affine)
)
(output_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
)
(4): TransformerBlock(
(attention): MultiHeadSelfAttention(
(dropout): Dropout(p=0.1, inplace=False)
(q_lin): DynamicQuantizedLinear(in_features=768, out_features=768, dtype=torch.qint8, qscheme=torch.per_tensor_affine)
(k_lin): DynamicQuantizedLinear(in_features=768, out_features=768, dtype=torch.qint8, qscheme=torch.per_tensor_affine)
(v_lin): DynamicQuantizedLinear(in_features=768, out_features=768, dtype=torch.qint8, qscheme=torch.per_tensor_affine)
(out_lin): DynamicQuantizedLinear(in_features=768, out_features=768, dtype=torch.qint8, qscheme=torch.per_tensor_affine)
)
(sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
(ffn): FFN(
(dropout): Dropout(p=0.1, inplace=False)
(lin1): DynamicQuantizedLinear(in_features=768, out_features=3072, dtype=torch.qint8, qscheme=torch.per_tensor_affine)
(lin2): DynamicQuantizedLinear(in_features=3072, out_features=768, dtype=torch.qint8, qscheme=torch.per_tensor_affine)
)
(output_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
)
(5): TransformerBlock(
(attention): MultiHeadSelfAttention(
(dropout): Dropout(p=0.1, inplace=False)
(q_lin): DynamicQuantizedLinear(in_features=768, out_features=768, dtype=torch.qint8, qscheme=torch.per_tensor_affine)
(k_lin): DynamicQuantizedLinear(in_features=768, out_features=768, dtype=torch.qint8, qscheme=torch.per_tensor_affine)
(v_lin): DynamicQuantizedLinear(in_features=768, out_features=768, dtype=torch.qint8, qscheme=torch.per_tensor_affine)
(out_lin): DynamicQuantizedLinear(in_features=768, out_features=768, dtype=torch.qint8, qscheme=torch.per_tensor_affine)
)
(sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
(ffn): FFN(
(dropout): Dropout(p=0.1, inplace=False)
(lin1): DynamicQuantizedLinear(in_features=768, out_features=3072, dtype=torch.qint8, qscheme=torch.per_tensor_affine)
(lin2): DynamicQuantizedLinear(in_features=3072, out_features=768, dtype=torch.qint8, qscheme=torch.per_tensor_affine)
)
(output_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
)
)
)
)
(dropout): Dropout(p=0.1, inplace=False)
(classifier): DynamicQuantizedLinear(in_features=768, out_features=5, dtype=torch.qint8, qscheme=torch.per_tensor_affine)
)
Expected behavior
You can successfully load the quantized fine-tuned model to make predictions. Can be the “DynamicQuantizedLinear” instead of “Linear” be causing this problem ?
Thanks in advance for your help.
Issue Analytics
- State:
- Created 3 years ago
- Comments:6 (2 by maintainers)
It is a matter of adding a few lines:
You are trying to load into a not-quantized module (ModelForTokenClassification) some quantized weights (
quantized_model = torch.quantization.quantize_dynamic(model, {torch.nn.Linear}, dtype=torch.qint8)
) You should make sure first that the instance you are loading into is actually a quantized model.