Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

label list in MNLI dataset

See original GitHub issue

Environment info

transformers version:
Platform: centos7.2
Python version: Python3.6.8
PyTorch version (GPU?): None
Tensorflow version (GPU?): None
Using GPU in script?: None
Using distributed or parallel set-up in script?: None

Who can help

Models:

albert, bert, xlm: @LysandreJik

Information

Model I am using bert-base-uncased-mnli

The problem arises when using:

the official example scripts: (give details below)
my own modified scripts: (give details below)

The tasks I am working on is:

an official GLUE/SQUaD task: MNLI
my own task or dataset:

To reproduce

When processiong label list for MNLI tasks, I noticed lable_list is defined different in Huggingface transformer and Hugging face dataset.

label list in datasets

If I load my data via datasets:

import datasets
from datasets import load_dataset, load_metric

raw_datasets = load_dataset("glue", 'mnli')
print(raw_datasets['validation_matched'].features['label'].names)

It returns:

['entailment', 'neutral', 'contradiction']

And label is also mentioned in document: https://huggingface.co/datasets/glue.

mnli
premise: a string feature.
hypothesis: a string feature.
label: a classification label, with possible values including entailment (0), neutral (1), contradiction (2).
idx: a int32 feature.

label list in transoformers

But in huggingface transformers:

processor = transformers.glue_processors['mnli']()
label_list = processor.get_labels()
print(label_list)

It returns:

['contradiction', 'entailment', 'neutral']

label configs

I checked the config used in datasets which is downloaded from https://raw.githubusercontent.com/huggingface/datasets/1.8.0/datasets/glue/glue.py.

The defination for label_classes is:

label_classes=["entailment", "neutral", "contradiction"],

And in transformer master, it is defined in function: https://github.com/huggingface/transformers/blob/15d19ecfda5de8c4b50e2cd3129a16de281dbd6d/src/transformers/data/processors/glue.py#L247

It’s confusing that same MNLI tasks uses different label order in datasets and transformers. I’m expecting it should be same on both datasets and transformers.

Issue Analytics

State:
Created 2 years ago
Comments:5 (3 by maintainers)

Top GitHub Comments

1reaction

lewtuncommented, Jul 21, 2021

Indeed the order between a model’s labels and those in a dataset can differ, which is why we’ve added a Dataset.align_labels_with_mapping function in this PR: https://github.com/huggingface/datasets/pull/2457

0reactions

github-actions[bot]commented, Aug 20, 2021

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Top Results From Across the Web

multi_nli · Datasets at Hugging Face

promptID (int32) pairID (string) genre (string) label (class label) 31,193 "31193n" "government" 1 (neutral) 101,457 "101457e" "telephone" 0 (entailment) 134,793 "134793e" "fiction" 0 (entailment)

ChaosNLI Dataset - Papers With Code

The dataset provides additional labels for NLI annotations that reflect the distribution of human annotators, instead of picking the majority label as the...

How to Label Text Classification Training Data -- With AI

Label your training data with a zero-shot Transformer model. Then, use the labelled data to fine-tune a small supervised model.

List by label | BigQuery - Google Cloud

List datasets, filtering by labels. Explore further. For detailed documentation that includes this code sample, see the following: Filtering resources using ...

glue - Datasets - TensorFlow

glue/mnli. Config description: The Multi-Genre Natural Language Inference Corpus is a crowdsourced collection of sentence pairs with textual entailment ...