question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

MNLI evaluation on pretrained models

See original GitHub issue

Environment info

  • transformers version: 4.4.dev / 4.3.3 / 4.3.2
  • Platform: Ubuntu 18.04/ Windows 10
  • Python version: 3.6.2
  • PyTorch version (GPU?): 1.7.1 (True)
  • Tensorflow version (GPU?): N/A
  • Using GPU in script?: Yes
  • Using distributed or parallel set-up in script?: No

Who can help

@patil-suraj , @sgugger, @LysandreJik

Information

Model I am using (Bert, XLNet …): huggingface/distilbert-base-uncased-finetuned-mnli - microsoft/deberta-v2-xxlarge-mnli - roberta-large-mnli - squeezebert/squeezebert-mnli - BERT-Base-MNLI…

The problem arises when using:

  • the official example scripts: (give details below)
  • my own modified scripts: (give details below)

The tasks I am working on is:

  • an official GLUE/SQUaD task: (give the name)
  • my own task or dataset: (give details below)

I use run_glue.py on fine-tuned models to reproduce the evaluation result (only --do_eval). But the accuracy is about 7%. Other tasks like MRPC or STS-B are ok when I use their fine-tuned models.

To reproduce

Steps to reproduce the behavior:

  1. Run python run_glue.py --model_name_or_path huggingface/distilbert-base-uncased-finetuned-mnli --task_name mnli --do_eval --max_seq_length 128 --output_dir temp/distill or any other MNLI fine-tuned model. I even tried a model that I fine-tuned myself using V2.10.0 and that again results in 6%-7% accuracy.
python run_glue.py --model_name_or_path huggingface/distilbert-base-uncased-finetuned-mnli --task_name mnli --do_eval --max_seq_length 128 --output_dir temp/distill
02/24/2021 11:38:34 - WARNING - main - Process rank: -1, device: cuda:0, n_gpu: 1distributed training: False, 16-bits training: False
02/24/2021 11:38:34 - INFO - main - Training/evaluation parameters TrainingArguments(output_dir=temp/distill, overwrite_output_dir=False, do_train=False, do_eval=True, do_predict=False, evaluation_strategy=EvaluationStrategy.NO, prediction_loss_only=False, per_device_train_batch_size=8, per_device_eval_batch_size=8, gradient_accumulation_steps=1, eval_accumulation_steps=None, learning_rate=5e-05, weight_decay=0.0, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-08, max_grad_norm=1.0, num_train_epochs=3.0, max_steps=-1, lr_scheduler_type=SchedulerType.LINEAR, warmup_steps=0, logging_dir=runs\Feb24_11-38-34_Ali_Workstation, logging_first_step=False, logging_steps=500, save_steps=500, save_total_limit=None, no_cuda=False, seed=42, fp16=False, fp16_opt_level=O1, fp16_backend=auto, local_rank=-1, tpu_num_cores=None, tpu_metrics_debug=False, debug=False, dataloader_drop_last=False, eval_steps=500, dataloader_num_workers=0, past_index=-1, run_name=temp/distill, disable_tqdm=False, remove_unused_columns=True, label_names=None, load_best_model_at_end=False, metric_for_best_model=None, greater_is_better=None, ignore_data_skip=False, sharded_ddp=False, deepspeed=None, label_smoothing_factor=0.0, adafactor=False, group_by_length=False, report_to=[], ddp_find_unused_parameters=None, dataloader_pin_memory=True, n_gpu=1)
02/24/2021 11:38:36 - WARNING - datasets.builder - Reusing dataset glue (C:\Users\Ali.cache\huggingface\datasets\glue\mnli\1.0.0\7c99657241149a24692c402a5c3f34d4c9f1df5ac2e4c3759fadea38f6cb29c4)
[INFO|configuration_utils.py:449] 2021-02-24 11:38:36,777 >> loading configuration file h***://huggingface.co/huggingface/distilbert-base-uncased-finetuned-mnli/resolve/main/config.json from cache at C:\Users\Ali/.cache\huggingface\transformers\240bd330b0e7919215436efe944c4073bfcc0bac4b7ed0a3378ab3d1793beb1a.acfb235b208288614b764ad50394132d4751a48a6c81fc382dc669e4d8a80a55
[INFO|configuration_utils.py:485] 2021-02-24 11:38:36,779 >> Model config DistilBertConfig {
“activation”: “gelu”,
“architectures”: [
“DistilBertForMaskedLM”
],
“attention_dropout”: 0.1,
“bos_token_id”: 0,
“dim”: 768,
“dropout”: 0.1,
“eos_token_ids”: 0,
“finetuning_task”: “mnli”,
“hidden_dim”: 3072,
“id2label”: {
“0”: “LABEL_0”,
“1”: “LABEL_1”,
“2”: “LABEL_2”
},
“initializer_range”: 0.02,
“label2id”: {
“LABEL_0”: 0,
“LABEL_1”: 1,
“LABEL_2”: 2
},
“max_position_embeddings”: 512,
“model_type”: “distilbert”,
“n_heads”: 12,
“n_layers”: 6,
“output_past”: true,
“pad_token_id”: 0,
“qa_dropout”: 0.1,
“seq_classif_dropout”: 0.2,
“sinusoidal_pos_embds”: false,
"tie_weights": true,
“transformers_version”: “4.3.2”,
“vocab_size”: 30522
}[INFO|configuration_utils.py:449] 2021-02-24 11:38:36,923 >> loading configuration file hs://huggingface.co/huggingface/distilbert-base-uncased-finetuned-mnli/resolve/main/config.json from cache at C:\Users\Ali/.cache\huggingface\transformers\240bd330b0e7919215436efe944c4073bfcc0bac4b7ed0a3378ab3d1793beb1a.acfb235b208288614b764ad50394132d4751a48a6c81fc382dc669e4d8a80a55
[INFO|configuration_utils.py:485] 2021-02-24 11:38:36,924 >> Model config DistilBertConfig {
“activation”: “gelu”,
“architectures”: [
“DistilBertForMaskedLM”
],
“attention_dropout”: 0.1,
“bos_token_id”: 0,
“dim”: 768,
“dropout”: 0.1,
“eos_token_ids”: 0,
“finetuning_task”: “mnli”,
“hidden_dim”: 3072,
“id2label”: {
“0”: “contradiction”,
“1”: “neutral”,
“2”: “entailment”
},
“initializer_range”: 0.02,
“label2id”: {
“contradiction”: “0”,
“entailment”: “2”,
“neutral”: “1”
},
“max_position_embeddings”: 512,
“model_type”: “distilbert”,
“n_heads”: 12,
“n_layers”: 6,
“output_past”: true,
“pad_token_id”: 0,
“qa_dropout”: 0.1,
“seq_classif_dropout”: 0.2,
“sinusoidal_pos_embds”: false,
“tie_weights_”: true,
“transformers_version”: “4.3.2”,
“vocab_size”: 30522
}
[INFO|tokenization_utils_base.py:1688] 2021-02-24 11:38:36,928 >> Model name ‘huggingface/distilbert-base-uncased-finetuned-mnli’ not found in model shortcut name list (distilbert-base-uncased, distilbert-base-uncased-distilled-squad, distilbert-base-cased, distilbert-base-cased-distilled-squad, distilbert-base-german-cased, distilbert-base-multilingual-cased). Assuming ‘huggingface/distilbert-base-uncased-finetuned-mnli’ is a path, a model identifier, or url to a directory containing tokenizer files.
[INFO|tokenization_utils_base.py:1786] 2021-02-24 11:38:37,946 >> loading file hps://huggingface.co/huggingface/distilbert-base-uncased-finetuned-mnli/resolve/main/vocab.txt from cache at C:\Users\Ali/.cache\huggingface\transformers\3aa49bfb368cde995cea246a5c5ca4d75f769e74b3e6d450776805f998c78366.d789d64ebfe299b0e416afc4a169632f903f693095b4629a7ea271d5a0cf2c99
[INFO|tokenization_utils_base.py:1786] 2021-02-24 11:38:37,947 >> loading file hps://huggingface.co/huggingface/distilbert-base-uncased-finetuned-mnli/resolve/main/tokenizer.json from cache at None
[INFO|tokenization_utils_base.py:1786] 2021-02-24 11:38:37,950 >> loading file htps://huggingface.co/huggingface/distilbert-base-uncased-finetuned-mnli/resolve/main/added_tokens.json from cache at C:\Users\Ali/.cache\huggingface\transformers\603dca04f5c89cbdcdb8021ec21c4376c7334fa6393347c80a54c942a93e50cb.5cc6e825eb228a7a5cfd27cb4d7151e97a79fb962b31aaf1813aa102e746584b
[INFO|tokenization_utils_base.py:1786] 2021-02-24 11:38:37,951 >> loading file ht*ps://huggingface.co/huggingface/distilbert-base-uncased-finetuned-mnli/resolve/main/special_tokens_map.json from cache at C:\Users\Ali/.cache\huggingface\transformers\dea17c39d149e23cb97e2a2829c6170489551d2454352fd18488f17bf90c54db.dd8bd9bfd3664b530ea4e645105f557769387b3da9f79bdb55ed556bdd80611d
[INFO|tokenization_utils_base.py:1786] 2021-02-24 11:38:37,952 >> loading file hps://huggingface.co/huggingface/distilbert-base-uncased-finetuned-mnli/resolve/main/tokenizer_config.json from cache at C:\Users\Ali/.cache\huggingface\transformers\ce6fb0f339483f5ca331e9631b13bc5e9c842e64e9a40aa60defb3898b99dbed.11d9edb6b1301b5af13d33c1585ff45ff84dd55cc6915c2872f856d1ee2dc409
[INFO|modeling_utils.py:1027] 2021-02-24 11:38:38,148 >> loading weights file hps://huggingface.co/huggingface/distilbert-base-uncased-finetuned-mnli/resolve/main/pytorch_model.bin from cache at C:\Users\Ali/.cache\huggingface\transformers\16516ebd442e5f41cd8caf2de88c478fe8a3a0948e20eaf1fdae0bf2d4998be6.73881288e7255a28dacc8ad53661dde9248c11f6e2d10f3b6db193dddee2a2bc
[INFO|modeling_utils.py:1143] 2021-02-24 11:38:39,218 >> All model checkpoint weights were used when initializing DistilBertForSequenceClassification.
[INFO|modeling_utils.py:1152] 2021-02-24 11:38:39,221 >> All the weights of DistilBertForSequenceClassification were initialized from the model checkpoint at huggingface/distilbert-base-uncased-finetuned-mnli.
If your task is similar to the task the model of the checkpoint was trained on, you can already use DistilBertForSequenceClassification for predictions without further training.
02/24/2021 11:38:39 - WARNING - datasets.arrow_dataset - Loading cached processed dataset at C:\Users\Ali.cache\huggingface\datasets\glue\mnli\1.0.0\7c99657241149a24692c402a5c3f34d4c9f1df5ac2e4c3759fadea38f6cb29c4\cache-0a88ac8e6b3bd378.arrow
02/24/2021 11:38:39 - WARNING - datasets.arrow_dataset - Loading cached processed dataset at C:\Users\Ali.cache\huggingface\datasets\glue\mnli\1.0.0\7c99657241149a24692c402a5c3f34d4c9f1df5ac2e4c3759fadea38f6cb29c4\cache-e1993e6695981db0.arrow
02/24/2021 11:38:39 - WARNING - datasets.arrow_dataset - Loading cached processed dataset at C:\Users\Ali.cache\huggingface\datasets\glue\mnli\1.0.0\7c99657241149a24692c402a5c3f34d4c9f1df5ac2e4c3759fadea38f6cb29c4\cache-133d62ae090971a5.arrow
02/24/2021 11:38:39 - WARNING - datasets.arrow_dataset - Loading cached processed dataset at C:\Users\Ali.cache\huggingface\datasets\glue\mnli\1.0.0\7c99657241149a24692c402a5c3f34d4c9f1df5ac2e4c3759fadea38f6cb29c4\cache-497afbfcce3a8a9d.arrow
02/24/2021 11:38:39 - WARNING - datasets.arrow_dataset - Loading cached processed dataset at C:\Users\Ali.cache\huggingface\datasets\glue\mnli\1.0.0\7c99657241149a24692c402a5c3f34d4c9f1df5ac2e4c3759fadea38f6cb29c4\cache-7146b31017748988.arrow
02/24/2021 11:38:39 - INFO - main - Sample 335243 of the training set: {‘attention_mask’: [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], ‘hypothesis’: “Parents are busy and it’s sometimes hard to get them out.”, ‘idx’: 335243, ‘input_ids’: [101, 2017, 2113, 2043, 2037, 3008, 2272, 1998, 2009, 1005, 1055, 2524, 2000, 2131, 2068, 2041, 1998, 1037, 2843, 1997, 3008, 2031, 3182, 2000, 2175, 1998, 1998, 2477, 2066, 2008, 1998, 2009, 1005, 1055, 2397, 2012, 2305, 2061, 102, 3008, 2024, 5697, 1998, 2009, 1005, 1055, 2823, 2524, 2000, 2131, 2068, 2041, 1012, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], ‘label’: 0, ‘premise’: “you know when their parents come and it’s hard to get them out and a lot of parents have places to go and and things like that and it’s late at night so”}.
02/24/2021 11:38:39 - INFO - main - Sample 58369 of the training set: {‘attention_mask’: [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], ‘hypothesis’: 'Where and what is art? ', ‘idx’: 58369, ‘input_ids’: [101, 2073, 2003, 2396, 1029, 102, 2073, 1998, 2054, 2003, 2396, 1029, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], ‘label’: 1, ‘premise’: ‘Where is art?’}.
02/24/2021 11:38:39 - INFO - main - Sample 13112 of the training set: {‘attention_mask’: [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], ‘hypothesis’: ‘The list says alcohol and injury are negatives facing staff.’, ‘idx’: 13112, ‘input_ids’: [101, 6544, 1998, 4544, 1010, 2004, 2092, 2004, 4766, 19388, 1010, 2024, 2006, 1996, 2862, 1012, 102, 1996, 2862, 2758, 6544, 1998, 4544, 2024, 4997, 2015, 5307, 3095, 1012, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], ‘label’: 1, ‘premise’: ‘Alcohol and injury, as well as brief interventions, are on the list.’}.
[INFO|trainer.py:432] 2021-02-24 11:38:41,361 >> The following columns in the training set don’t have a corresponding argument in DistilBertForSequenceClassification.forward and have been ignored: premise, hypothesis, idx.
[INFO|trainer.py:432] 2021-02-24 11:38:41,362 >> The following columns in the evaluation set don’t have a corresponding argument in DistilBertForSequenceClassification.forward and have been ignored: premise, hypothesis, idx.
02/24/2021 11:38:41 - INFO - main - *** Evaluate ***
[INFO|trainer.py:432] 2021-02-24 11:38:41,366 >> The following columns in the evaluation set don’t have a corresponding argument in DistilBertForSequenceClassification.forward and have been ignored: premise, hypothesis, idx.
[INFO|trainer.py:1600] 2021-02-24 11:38:41,371 >> ***** Running Evaluation *****
[INFO|trainer.py:1601] 2021-02-24 11:38:41,371 >> Num examples = 9815
[INFO|trainer.py:1602] 2021-02-24 11:38:41,372 >> Batch size = 8
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1227/1227 [00:10<00:00, 122.19it/s]
02/24/2021 11:38:52 - INFO - main - ***** Eval results mnli *****
02/24/2021 11:38:52 - INFO - main - eval_accuracy = 0.07865511971472236
02/24/2021 11:38:52 - INFO - main - eval_loss = 4.536623954772949
02/24/2021 11:38:52 - INFO - main - eval_runtime = 10.733
02/24/2021 11:38:52 - INFO - main - eval_samples_per_second = 914.471
[INFO|trainer.py:432] 2021-02-24 11:38:52,120 >> The following columns in the evaluation set don’t have a corresponding argument in DistilBertForSequenceClassification.forward and have been ignored: premise, hypothesis, idx.
[INFO|trainer.py:1600] 2021-02-24 11:38:52,124 >> ***** Running Evaluation *****
[INFO|trainer.py:1601] 2021-02-24 11:38:52,124 >> Num examples = 9832
[INFO|trainer.py:1602] 2021-02-24 11:38:52,125 >> Batch size = 8
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1229/1229 [00:10<00:00, 121.59it/s]
02/24/2021 11:39:02 - INFO - main - ***** Eval results mnli-mm *****
02/24/2021 11:39:02 - INFO - main - eval_accuracy = 0.08482506102522376
02/24/2021 11:39:02 - INFO - main - eval_loss = 4.487601280212402
02/24/2021 11:39:02 - INFO - main - eval_runtime = 10.127
02/24/2021 11:39:02 - INFO - main - eval_samples_per_second = 970.87


Expected behavior

It seems all the weights are loaded in the correct place, but the accuracy is below 10% which should be above 80%.

[INFO|modeling_utils.py:1143] 2021-02-24 11:38:39,218 >> All model checkpoint weights were used when initializing DistilBertForSequenceClassification.
[INFO|modeling_utils.py:1152] 2021-02-24 11:38:39,221 >> All the weights of DistilBertForSequenceClassification were initialized from the model checkpoint at huggingface/distilbert-base-uncased-finetuned-mnli.
If your task is similar to the task the model of the checkpoint was trained on, you can already use DistilBertForSequenceClassification for predictions without further training.

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:7 (3 by maintainers)

github_iconTop GitHub Comments

1reaction
sguggercommented, Feb 17, 2022

The model has been fixed a year ago, in this commit

1reaction
LysandreJikcommented, Feb 24, 2021

Hello! This may be because of labels being switched around for the MNLI task. See this thread https://github.com/huggingface/transformers/pull/10203 for more context.

Read more comments on GitHub >

github_iconTop Results From Across the Web

roberta-large-mnli - Hugging Face
The model is a pretrained model on English language text using a masked ... model was evaluated on the Multi-Genre Natural Language Inference...
Read more >
MultiNLI Benchmark (Natural Language Inference)
Rank Model Matched Mismatched Year 1 T5‑11B 92.0 91.7 2019 2 T5 92.0 91.7 2019 3 T5‑3B 91.4 91.2 2019
Read more >
Pre-trained language models evaluating themselves
In this work, we examine the recently intro- duced metrics BERTScore, BLEURT, NUBIA,. MoverScore, and Mark-Evaluate (Petersen). We investigate ...
Read more >
Fine-tuning pretrained NLP models with Huggingface's Trainer
The data allows us to train a model to detect the sentiment of the movie review- 1 being positive while 0 being negative....
Read more >
Chapter 11 Resources and Benchmarks for NLP - GitHub Pages
Models are evaluated using accuracy. Recognizing Textual Entailment is akin to MNLI, only this time with a two-class split. The Winograd Schema Challenge...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found