label list in MNLI dataset
See original GitHub issueEnvironment info
transformers
version:- Platform: centos7.2
- Python version: Python3.6.8
- PyTorch version (GPU?): None
- Tensorflow version (GPU?): None
- Using GPU in script?: None
- Using distributed or parallel set-up in script?: None
Who can help
Models:
- albert, bert, xlm: @LysandreJik
Information
Model I am using bert-base-uncased-mnli
The problem arises when using:
- the official example scripts: (give details below)
- my own modified scripts: (give details below)
The tasks I am working on is:
- an official GLUE/SQUaD task: MNLI
- my own task or dataset:
To reproduce
When processiong label list for MNLI tasks, I noticed lable_list is defined different in Huggingface transformer and Hugging face dataset.
label list in datasets
If I load my data via datasets:
import datasets
from datasets import load_dataset, load_metric
raw_datasets = load_dataset("glue", 'mnli')
print(raw_datasets['validation_matched'].features['label'].names)
It returns:
['entailment', 'neutral', 'contradiction']
And label is also mentioned in document: https://huggingface.co/datasets/glue.
mnli
premise: a string feature.
hypothesis: a string feature.
label: a classification label, with possible values including entailment (0), neutral (1), contradiction (2).
idx: a int32 feature.
label list in transoformers
But in huggingface transformers:
processor = transformers.glue_processors['mnli']()
label_list = processor.get_labels()
print(label_list)
It returns:
['contradiction', 'entailment', 'neutral']
label configs
I checked the config used in datasets which is downloaded from https://raw.githubusercontent.com/huggingface/datasets/1.8.0/datasets/glue/glue.py.
The defination for label_classes is:
label_classes=["entailment", "neutral", "contradiction"],
And in transformer master, it is defined in function: https://github.com/huggingface/transformers/blob/15d19ecfda5de8c4b50e2cd3129a16de281dbd6d/src/transformers/data/processors/glue.py#L247
It’s confusing that same MNLI tasks uses different label order in datasets and transformers. I’m expecting it should be same on both datasets and transformers.
Issue Analytics
- State:
- Created 2 years ago
- Comments:5 (3 by maintainers)
Top GitHub Comments
Indeed the order between a model’s labels and those in a dataset can differ, which is why we’ve added a
Dataset.align_labels_with_mapping
function in this PR: https://github.com/huggingface/datasets/pull/2457This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.