Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[docs]Incorrect way of input encoding for "multiple choice" models in documentation?

See original GitHub issue

In the documentation about “xxForMultipleChoice” models like BERT, ALBERT, RoBERTa, the examples goes like this:

>>> from transformers import BertTokenizer, BertForMultipleChoice
>>> import torch

>>> tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
>>> model = BertForMultipleChoice.from_pretrained('bert-base-uncased')

>>> prompt = "In Italy, pizza served in formal settings, such as at a restaurant, is presented unsliced."
>>> choice0 = "It is eaten with a fork and a knife."
>>> choice1 = "It is eaten while held in the hand."
>>> labels = torch.tensor(0).unsqueeze(0)  # choice0 is correct (according to Wikipedia ;)), batch size 1

>>> encoding = tokenizer([[prompt, prompt], [choice0, choice1]], return_tensors='pt', padding=True)
>>> outputs = model(**{k: v.unsqueeze(0) for k,v in encoding.items()}, labels=labels)  # batch size is 1
...

In current version (4.5.1), the encoding actually consists of 2 sentences: [prompt + prompt], and [choice0 + choice1], which to my knowledge is incorrect, as each encoded sentence should include one prompt and one choice. I think the encoding supposed to be like:

tokenizer([[prompt, choice0], [prompt, choice1]], return_tensors='pt', padding=True)

So, is there anything wrong?

Issue Analytics

State:
Created 2 years ago
Comments:6 (2 by maintainers)

Top GitHub Comments

2reactions

sguggercommented, Jul 21, 2021

This has been fixed, but you need to switch to the master documentation to see the change.

0reactions

mbforbescommented, Jul 21, 2021

I’m running into this issue as well.

I’m not super familiar working with multiple choice models, but I think that given #6074, and the run_swag.py example, these should be passed as two lists, instead of one.

In other words, in the example, instead of

encoding = tokenizer([[prompt, prompt], [choice0, choice1]], return_tensors='pt', padding=True)

it should be

encoding = tokenizer([prompt, prompt], [choice0, choice1], return_tensors='pt', padding=True)

It would be awesome to make this two-character-deletion change, as it just tripped me up when starting working on a multiple choice model!

Top Results From Across the Web

Categorical Feature Encoding - Towards Data Science

Categorical feature encoding is often a key part of the data science process and can be done in multiple ways leading to different...

Encoding of categorical variables — Scikit-learn course

In this notebook, we will present typical ways of dealing with categorical variables by encoding them, namely ordinal encoding and one-hot encoding.

3 Ways to Encode Categorical Variables for Deep Learning

When using a model with multiple inputs, we will need to specify a list that has one dataset for each input, e.g. a...

How to Perform One-Hot Encoding For Multi Categorical ...

They require all input variables and output variables to be numeric. This means that categorical data must be converted to a numerical form....

How do I encode categorical features using scikit-learn?

In order to include categorical features in your Machine Learning model, you have to encode them numerically using "dummy" or "one-hot" ...

Troubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.

Start Free

Top Related Reddit Thread

No results found

Top Related Tweet

No results found

Top Related Dev.to Post

No results found

[docs]Incorrect way of input encoding for "multiple choice" models in documentation?

Issue Analytics

Top GitHub Comments

Top Results From Across the Web

Top Related Medium Post

Top Related StackOverflow Question

Troubleshoot Live Code

Top Related Reddit Thread

Top Related Hackernoon Post

Top Related Tweet

Top Related Dev.to Post

Top Related Hashnode Post

S3 checkpoints not working with distributed training on sagemaker

[examples] UserWarning: `max_length` is deprecated