question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

label list in MNLI dataset

See original GitHub issue

Environment info

  • transformers version:
  • Platform: centos7.2
  • Python version: Python3.6.8
  • PyTorch version (GPU?): None
  • Tensorflow version (GPU?): None
  • Using GPU in script?: None
  • Using distributed or parallel set-up in script?: None

Who can help

Models:

Information

Model I am using bert-base-uncased-mnli

The problem arises when using:

  • the official example scripts: (give details below)
  • my own modified scripts: (give details below)

The tasks I am working on is:

  • an official GLUE/SQUaD task: MNLI
  • my own task or dataset:

To reproduce

When processiong label list for MNLI tasks, I noticed lable_list is defined different in Huggingface transformer and Hugging face dataset.

label list in datasets

If I load my data via datasets:

import datasets
from datasets import load_dataset, load_metric

raw_datasets = load_dataset("glue", 'mnli')
print(raw_datasets['validation_matched'].features['label'].names)

It returns:

['entailment', 'neutral', 'contradiction']

And label is also mentioned in document: https://huggingface.co/datasets/glue.

mnli
premise: a string feature.
hypothesis: a string feature.
label: a classification label, with possible values including entailment (0), neutral (1), contradiction (2).
idx: a int32 feature.

label list in transoformers

But in huggingface transformers:

processor = transformers.glue_processors['mnli']()
label_list = processor.get_labels()
print(label_list)

It returns:

['contradiction', 'entailment', 'neutral']

label configs

I checked the config used in datasets which is downloaded from https://raw.githubusercontent.com/huggingface/datasets/1.8.0/datasets/glue/glue.py.

The defination for label_classes is:

label_classes=["entailment", "neutral", "contradiction"],

And in transformer master, it is defined in function: https://github.com/huggingface/transformers/blob/15d19ecfda5de8c4b50e2cd3129a16de281dbd6d/src/transformers/data/processors/glue.py#L247

It’s confusing that same MNLI tasks uses different label order in datasets and transformers. I’m expecting it should be same on both datasets and transformers.

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:5 (3 by maintainers)

github_iconTop GitHub Comments

1reaction
lewtuncommented, Jul 21, 2021

Indeed the order between a model’s labels and those in a dataset can differ, which is why we’ve added a Dataset.align_labels_with_mapping function in this PR: https://github.com/huggingface/datasets/pull/2457

0reactions
github-actions[bot]commented, Aug 20, 2021

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Read more comments on GitHub >

github_iconTop Results From Across the Web

multi_nli · Datasets at Hugging Face
promptID (int32) pairID (string) genre (string) label (class label) 31,193 "31193n" "government" 1 (neutral) 101,457 "101457e" "telephone" 0 (entailment) 134,793 "134793e" "fiction" 0 (entailment)
Read more >
ChaosNLI Dataset - Papers With Code
The dataset provides additional labels for NLI annotations that reflect the distribution of human annotators, instead of picking the majority label as the...
Read more >
How to Label Text Classification Training Data -- With AI
Label your training data with a zero-shot Transformer model. Then, use the labelled data to fine-tune a small supervised model.
Read more >
List by label | BigQuery - Google Cloud
List datasets, filtering by labels. Explore further. For detailed documentation that includes this code sample, see the following: Filtering resources using ...
Read more >
glue - Datasets - TensorFlow
glue/mnli. Config description: The Multi-Genre Natural Language Inference Corpus is a crowdsourced collection of sentence pairs with textual entailment ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found