MNLI evaluation on pretrained models
See original GitHub issueEnvironment info
transformers
version: 4.4.dev / 4.3.3 / 4.3.2- Platform: Ubuntu 18.04/ Windows 10
- Python version: 3.6.2
- PyTorch version (GPU?): 1.7.1 (True)
- Tensorflow version (GPU?): N/A
- Using GPU in script?: Yes
- Using distributed or parallel set-up in script?: No
Who can help
@patil-suraj , @sgugger, @LysandreJikInformation
Model I am using (Bert, XLNet …): huggingface/distilbert-base-uncased-finetuned-mnli - microsoft/deberta-v2-xxlarge-mnli - roberta-large-mnli - squeezebert/squeezebert-mnli - BERT-Base-MNLI…
The problem arises when using:
- the official example scripts: (give details below)
- my own modified scripts: (give details below)
The tasks I am working on is:
- an official GLUE/SQUaD task: (give the name)
- my own task or dataset: (give details below)
I use run_glue.py on fine-tuned models to reproduce the evaluation result (only --do_eval
). But the accuracy is about 7%. Other tasks like MRPC or STS-B are ok when I use their fine-tuned models.
To reproduce
Steps to reproduce the behavior:
- Run
python run_glue.py --model_name_or_path huggingface/distilbert-base-uncased-finetuned-mnli --task_name mnli --do_eval --max_seq_length 128 --output_dir temp/distill
or any other MNLI fine-tuned model. I even tried a model that I fine-tuned myself using V2.10.0 and that again results in 6%-7% accuracy.
python run_glue.py --model_name_or_path huggingface/distilbert-base-uncased-finetuned-mnli --task_name mnli --do_eval --max_seq_length 128 --output_dir temp/distill
02/24/2021 11:38:34 - WARNING - main - Process rank: -1, device: cuda:0, n_gpu: 1distributed training: False, 16-bits training: False
02/24/2021 11:38:34 - INFO - main - Training/evaluation parameters TrainingArguments(output_dir=temp/distill, overwrite_output_dir=False, do_train=False, do_eval=True, do_predict=False, evaluation_strategy=EvaluationStrategy.NO, prediction_loss_only=False, per_device_train_batch_size=8, per_device_eval_batch_size=8, gradient_accumulation_steps=1, eval_accumulation_steps=None, learning_rate=5e-05, weight_decay=0.0, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-08, max_grad_norm=1.0, num_train_epochs=3.0, max_steps=-1, lr_scheduler_type=SchedulerType.LINEAR, warmup_steps=0, logging_dir=runs\Feb24_11-38-34_Ali_Workstation, logging_first_step=False, logging_steps=500, save_steps=500, save_total_limit=None, no_cuda=False, seed=42, fp16=False, fp16_opt_level=O1, fp16_backend=auto, local_rank=-1, tpu_num_cores=None, tpu_metrics_debug=False, debug=False, dataloader_drop_last=False, eval_steps=500, dataloader_num_workers=0, past_index=-1, run_name=temp/distill, disable_tqdm=False, remove_unused_columns=True, label_names=None, load_best_model_at_end=False, metric_for_best_model=None, greater_is_better=None, ignore_data_skip=False, sharded_ddp=False, deepspeed=None, label_smoothing_factor=0.0, adafactor=False, group_by_length=False, report_to=[], ddp_find_unused_parameters=None, dataloader_pin_memory=True, n_gpu=1)
02/24/2021 11:38:36 - WARNING - datasets.builder - Reusing dataset glue (C:\Users\Ali.cache\huggingface\datasets\glue\mnli\1.0.0\7c99657241149a24692c402a5c3f34d4c9f1df5ac2e4c3759fadea38f6cb29c4)
[INFO|configuration_utils.py:449] 2021-02-24 11:38:36,777 >> loading configuration file h***://huggingface.co/huggingface/distilbert-base-uncased-finetuned-mnli/resolve/main/config.json from cache at C:\Users\Ali/.cache\huggingface\transformers\240bd330b0e7919215436efe944c4073bfcc0bac4b7ed0a3378ab3d1793beb1a.acfb235b208288614b764ad50394132d4751a48a6c81fc382dc669e4d8a80a55
[INFO|configuration_utils.py:485] 2021-02-24 11:38:36,779 >> Model config DistilBertConfig {
“activation”: “gelu”,
“architectures”: [
“DistilBertForMaskedLM”
],
“attention_dropout”: 0.1,
“bos_token_id”: 0,
“dim”: 768,
“dropout”: 0.1,
“eos_token_ids”: 0,
“finetuning_task”: “mnli”,
“hidden_dim”: 3072,
“id2label”: {
“0”: “LABEL_0”,
“1”: “LABEL_1”,
“2”: “LABEL_2”
},
“initializer_range”: 0.02,
“label2id”: {
“LABEL_0”: 0,
“LABEL_1”: 1,
“LABEL_2”: 2
},
“max_position_embeddings”: 512,
“model_type”: “distilbert”,
“n_heads”: 12,
“n_layers”: 6,
“output_past”: true,
“pad_token_id”: 0,
“qa_dropout”: 0.1,
“seq_classif_dropout”: 0.2,
“sinusoidal_pos_embds”: false,
"tie_weights": true,
“transformers_version”: “4.3.2”,
“vocab_size”: 30522
}[INFO|configuration_utils.py:449] 2021-02-24 11:38:36,923 >> loading configuration file hs://huggingface.co/huggingface/distilbert-base-uncased-finetuned-mnli/resolve/main/config.json from cache at C:\Users\Ali/.cache\huggingface\transformers\240bd330b0e7919215436efe944c4073bfcc0bac4b7ed0a3378ab3d1793beb1a.acfb235b208288614b764ad50394132d4751a48a6c81fc382dc669e4d8a80a55
[INFO|configuration_utils.py:485] 2021-02-24 11:38:36,924 >> Model config DistilBertConfig {
“activation”: “gelu”,
“architectures”: [
“DistilBertForMaskedLM”
],
“attention_dropout”: 0.1,
“bos_token_id”: 0,
“dim”: 768,
“dropout”: 0.1,
“eos_token_ids”: 0,
“finetuning_task”: “mnli”,
“hidden_dim”: 3072,
“id2label”: {
“0”: “contradiction”,
“1”: “neutral”,
“2”: “entailment”
},
“initializer_range”: 0.02,
“label2id”: {
“contradiction”: “0”,
“entailment”: “2”,
“neutral”: “1”
},
“max_position_embeddings”: 512,
“model_type”: “distilbert”,
“n_heads”: 12,
“n_layers”: 6,
“output_past”: true,
“pad_token_id”: 0,
“qa_dropout”: 0.1,
“seq_classif_dropout”: 0.2,
“sinusoidal_pos_embds”: false,
“tie_weights_”: true,
“transformers_version”: “4.3.2”,
“vocab_size”: 30522
}
[INFO|tokenization_utils_base.py:1688] 2021-02-24 11:38:36,928 >> Model name ‘huggingface/distilbert-base-uncased-finetuned-mnli’ not found in model shortcut name list (distilbert-base-uncased, distilbert-base-uncased-distilled-squad, distilbert-base-cased, distilbert-base-cased-distilled-squad, distilbert-base-german-cased, distilbert-base-multilingual-cased). Assuming ‘huggingface/distilbert-base-uncased-finetuned-mnli’ is a path, a model identifier, or url to a directory containing tokenizer files.
[INFO|tokenization_utils_base.py:1786] 2021-02-24 11:38:37,946 >> loading file hps://huggingface.co/huggingface/distilbert-base-uncased-finetuned-mnli/resolve/main/vocab.txt from cache at C:\Users\Ali/.cache\huggingface\transformers\3aa49bfb368cde995cea246a5c5ca4d75f769e74b3e6d450776805f998c78366.d789d64ebfe299b0e416afc4a169632f903f693095b4629a7ea271d5a0cf2c99
[INFO|tokenization_utils_base.py:1786] 2021-02-24 11:38:37,947 >> loading file hps://huggingface.co/huggingface/distilbert-base-uncased-finetuned-mnli/resolve/main/tokenizer.json from cache at None
[INFO|tokenization_utils_base.py:1786] 2021-02-24 11:38:37,950 >> loading file htps://huggingface.co/huggingface/distilbert-base-uncased-finetuned-mnli/resolve/main/added_tokens.json from cache at C:\Users\Ali/.cache\huggingface\transformers\603dca04f5c89cbdcdb8021ec21c4376c7334fa6393347c80a54c942a93e50cb.5cc6e825eb228a7a5cfd27cb4d7151e97a79fb962b31aaf1813aa102e746584b
[INFO|tokenization_utils_base.py:1786] 2021-02-24 11:38:37,951 >> loading file ht*ps://huggingface.co/huggingface/distilbert-base-uncased-finetuned-mnli/resolve/main/special_tokens_map.json from cache at C:\Users\Ali/.cache\huggingface\transformers\dea17c39d149e23cb97e2a2829c6170489551d2454352fd18488f17bf90c54db.dd8bd9bfd3664b530ea4e645105f557769387b3da9f79bdb55ed556bdd80611d
[INFO|tokenization_utils_base.py:1786] 2021-02-24 11:38:37,952 >> loading file hps://huggingface.co/huggingface/distilbert-base-uncased-finetuned-mnli/resolve/main/tokenizer_config.json from cache at C:\Users\Ali/.cache\huggingface\transformers\ce6fb0f339483f5ca331e9631b13bc5e9c842e64e9a40aa60defb3898b99dbed.11d9edb6b1301b5af13d33c1585ff45ff84dd55cc6915c2872f856d1ee2dc409
[INFO|modeling_utils.py:1027] 2021-02-24 11:38:38,148 >> loading weights file hps://huggingface.co/huggingface/distilbert-base-uncased-finetuned-mnli/resolve/main/pytorch_model.bin from cache at C:\Users\Ali/.cache\huggingface\transformers\16516ebd442e5f41cd8caf2de88c478fe8a3a0948e20eaf1fdae0bf2d4998be6.73881288e7255a28dacc8ad53661dde9248c11f6e2d10f3b6db193dddee2a2bc
[INFO|modeling_utils.py:1143] 2021-02-24 11:38:39,218 >> All model checkpoint weights were used when initializing DistilBertForSequenceClassification.
[INFO|modeling_utils.py:1152] 2021-02-24 11:38:39,221 >> All the weights of DistilBertForSequenceClassification were initialized from the model checkpoint at huggingface/distilbert-base-uncased-finetuned-mnli.
If your task is similar to the task the model of the checkpoint was trained on, you can already use DistilBertForSequenceClassification for predictions without further training.
02/24/2021 11:38:39 - WARNING - datasets.arrow_dataset - Loading cached processed dataset at C:\Users\Ali.cache\huggingface\datasets\glue\mnli\1.0.0\7c99657241149a24692c402a5c3f34d4c9f1df5ac2e4c3759fadea38f6cb29c4\cache-0a88ac8e6b3bd378.arrow
02/24/2021 11:38:39 - WARNING - datasets.arrow_dataset - Loading cached processed dataset at C:\Users\Ali.cache\huggingface\datasets\glue\mnli\1.0.0\7c99657241149a24692c402a5c3f34d4c9f1df5ac2e4c3759fadea38f6cb29c4\cache-e1993e6695981db0.arrow
02/24/2021 11:38:39 - WARNING - datasets.arrow_dataset - Loading cached processed dataset at C:\Users\Ali.cache\huggingface\datasets\glue\mnli\1.0.0\7c99657241149a24692c402a5c3f34d4c9f1df5ac2e4c3759fadea38f6cb29c4\cache-133d62ae090971a5.arrow
02/24/2021 11:38:39 - WARNING - datasets.arrow_dataset - Loading cached processed dataset at C:\Users\Ali.cache\huggingface\datasets\glue\mnli\1.0.0\7c99657241149a24692c402a5c3f34d4c9f1df5ac2e4c3759fadea38f6cb29c4\cache-497afbfcce3a8a9d.arrow
02/24/2021 11:38:39 - WARNING - datasets.arrow_dataset - Loading cached processed dataset at C:\Users\Ali.cache\huggingface\datasets\glue\mnli\1.0.0\7c99657241149a24692c402a5c3f34d4c9f1df5ac2e4c3759fadea38f6cb29c4\cache-7146b31017748988.arrow
02/24/2021 11:38:39 - INFO - main - Sample 335243 of the training set: {‘attention_mask’: [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], ‘hypothesis’: “Parents are busy and it’s sometimes hard to get them out.”, ‘idx’: 335243, ‘input_ids’: [101, 2017, 2113, 2043, 2037, 3008, 2272, 1998, 2009, 1005, 1055, 2524, 2000, 2131, 2068, 2041, 1998, 1037, 2843, 1997, 3008, 2031, 3182, 2000, 2175, 1998, 1998, 2477, 2066, 2008, 1998, 2009, 1005, 1055, 2397, 2012, 2305, 2061, 102, 3008, 2024, 5697, 1998, 2009, 1005, 1055, 2823, 2524, 2000, 2131, 2068, 2041, 1012, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], ‘label’: 0, ‘premise’: “you know when their parents come and it’s hard to get them out and a lot of parents have places to go and and things like that and it’s late at night so”}.
02/24/2021 11:38:39 - INFO - main - Sample 58369 of the training set: {‘attention_mask’: [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], ‘hypothesis’: 'Where and what is art? ', ‘idx’: 58369, ‘input_ids’: [101, 2073, 2003, 2396, 1029, 102, 2073, 1998, 2054, 2003, 2396, 1029, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], ‘label’: 1, ‘premise’: ‘Where is art?’}.
02/24/2021 11:38:39 - INFO - main - Sample 13112 of the training set: {‘attention_mask’: [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], ‘hypothesis’: ‘The list says alcohol and injury are negatives facing staff.’, ‘idx’: 13112, ‘input_ids’: [101, 6544, 1998, 4544, 1010, 2004, 2092, 2004, 4766, 19388, 1010, 2024, 2006, 1996, 2862, 1012, 102, 1996, 2862, 2758, 6544, 1998, 4544, 2024, 4997, 2015, 5307, 3095, 1012, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], ‘label’: 1, ‘premise’: ‘Alcohol and injury, as well as brief interventions, are on the list.’}.
[INFO|trainer.py:432] 2021-02-24 11:38:41,361 >> The following columns in the training set don’t have a corresponding argument in DistilBertForSequenceClassification.forward and have been ignored: premise, hypothesis, idx.
[INFO|trainer.py:432] 2021-02-24 11:38:41,362 >> The following columns in the evaluation set don’t have a corresponding argument in DistilBertForSequenceClassification.forward and have been ignored: premise, hypothesis, idx.
02/24/2021 11:38:41 - INFO - main - *** Evaluate ***
[INFO|trainer.py:432] 2021-02-24 11:38:41,366 >> The following columns in the evaluation set don’t have a corresponding argument in DistilBertForSequenceClassification.forward and have been ignored: premise, hypothesis, idx.
[INFO|trainer.py:1600] 2021-02-24 11:38:41,371 >> ***** Running Evaluation *****
[INFO|trainer.py:1601] 2021-02-24 11:38:41,371 >> Num examples = 9815
[INFO|trainer.py:1602] 2021-02-24 11:38:41,372 >> Batch size = 8
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1227/1227 [00:10<00:00, 122.19it/s]
02/24/2021 11:38:52 - INFO - main - ***** Eval results mnli *****
02/24/2021 11:38:52 - INFO - main - eval_accuracy = 0.07865511971472236
02/24/2021 11:38:52 - INFO - main - eval_loss = 4.536623954772949
02/24/2021 11:38:52 - INFO - main - eval_runtime = 10.733
02/24/2021 11:38:52 - INFO - main - eval_samples_per_second = 914.471
[INFO|trainer.py:432] 2021-02-24 11:38:52,120 >> The following columns in the evaluation set don’t have a corresponding argument in DistilBertForSequenceClassification.forward and have been ignored: premise, hypothesis, idx.
[INFO|trainer.py:1600] 2021-02-24 11:38:52,124 >> ***** Running Evaluation *****
[INFO|trainer.py:1601] 2021-02-24 11:38:52,124 >> Num examples = 9832
[INFO|trainer.py:1602] 2021-02-24 11:38:52,125 >> Batch size = 8
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1229/1229 [00:10<00:00, 121.59it/s]
02/24/2021 11:39:02 - INFO - main - ***** Eval results mnli-mm *****
02/24/2021 11:39:02 - INFO - main - eval_accuracy = 0.08482506102522376
02/24/2021 11:39:02 - INFO - main - eval_loss = 4.487601280212402
02/24/2021 11:39:02 - INFO - main - eval_runtime = 10.127
02/24/2021 11:39:02 - INFO - main - eval_samples_per_second = 970.87
Expected behavior
It seems all the weights are loaded in the correct place, but the accuracy is below 10% which should be above 80%.
[INFO|modeling_utils.py:1143] 2021-02-24 11:38:39,218 >> All model checkpoint weights were used when initializing DistilBertForSequenceClassification.
[INFO|modeling_utils.py:1152] 2021-02-24 11:38:39,221 >> All the weights of DistilBertForSequenceClassification were initialized from the model checkpoint at huggingface/distilbert-base-uncased-finetuned-mnli.
If your task is similar to the task the model of the checkpoint was trained on, you can already use DistilBertForSequenceClassification for predictions without further training.
Issue Analytics
- State:
- Created 3 years ago
- Comments:7 (3 by maintainers)
Top Results From Across the Web
roberta-large-mnli - Hugging Face
The model is a pretrained model on English language text using a masked ... model was evaluated on the Multi-Genre Natural Language Inference...
Read more >MultiNLI Benchmark (Natural Language Inference)
Rank Model Matched Mismatched Year
1 T5‑11B 92.0 91.7 2019
2 T5 92.0 91.7 2019
3 T5‑3B 91.4 91.2 2019
Read more >Pre-trained language models evaluating themselves
In this work, we examine the recently intro- duced metrics BERTScore, BLEURT, NUBIA,. MoverScore, and Mark-Evaluate (Petersen). We investigate ...
Read more >Fine-tuning pretrained NLP models with Huggingface's Trainer
The data allows us to train a model to detect the sentiment of the movie review- 1 being positive while 0 being negative....
Read more >Chapter 11 Resources and Benchmarks for NLP - GitHub Pages
Models are evaluated using accuracy. Recognizing Textual Entailment is akin to MNLI, only this time with a two-class split. The Winograd Schema Challenge...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
The model has been fixed a year ago, in this commit
Hello! This may be because of labels being switched around for the MNLI task. See this thread https://github.com/huggingface/transformers/pull/10203 for more context.