question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[docs]Incorrect way of input encoding for "multiple choice" models in documentation?

See original GitHub issue

In the documentation about “xxForMultipleChoice” models like BERT, ALBERT, RoBERTa, the examples goes like this:

>>> from transformers import BertTokenizer, BertForMultipleChoice
>>> import torch

>>> tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
>>> model = BertForMultipleChoice.from_pretrained('bert-base-uncased')

>>> prompt = "In Italy, pizza served in formal settings, such as at a restaurant, is presented unsliced."
>>> choice0 = "It is eaten with a fork and a knife."
>>> choice1 = "It is eaten while held in the hand."
>>> labels = torch.tensor(0).unsqueeze(0)  # choice0 is correct (according to Wikipedia ;)), batch size 1

>>> encoding = tokenizer([[prompt, prompt], [choice0, choice1]], return_tensors='pt', padding=True)
>>> outputs = model(**{k: v.unsqueeze(0) for k,v in encoding.items()}, labels=labels)  # batch size is 1
...

In current version (4.5.1), the encoding actually consists of 2 sentences: [prompt + prompt], and [choice0 + choice1], which to my knowledge is incorrect, as each encoded sentence should include one prompt and one choice. I think the encoding supposed to be like:

tokenizer([[prompt, choice0], [prompt, choice1]], return_tensors='pt', padding=True)

So, is there anything wrong?

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:6 (2 by maintainers)

github_iconTop GitHub Comments

2reactions
sguggercommented, Jul 21, 2021

This has been fixed, but you need to switch to the master documentation to see the change.

0reactions
mbforbescommented, Jul 21, 2021

I’m running into this issue as well.

I’m not super familiar working with multiple choice models, but I think that given #6074, and the run_swag.py example, these should be passed as two lists, instead of one.

In other words, in the example, instead of

encoding = tokenizer([[prompt, prompt], [choice0, choice1]], return_tensors='pt', padding=True)

it should be

encoding = tokenizer([prompt, prompt], [choice0, choice1], return_tensors='pt', padding=True)

It would be awesome to make this two-character-deletion change, as it just tripped me up when starting working on a multiple choice model!

Read more comments on GitHub >

github_iconTop Results From Across the Web

Categorical Feature Encoding - Towards Data Science
Categorical feature encoding is often a key part of the data science process and can be done in multiple ways leading to different...
Read more >
Encoding of categorical variables — Scikit-learn course
In this notebook, we will present typical ways of dealing with categorical variables by encoding them, namely ordinal encoding and one-hot encoding.
Read more >
3 Ways to Encode Categorical Variables for Deep Learning
When using a model with multiple inputs, we will need to specify a list that has one dataset for each input, e.g. a...
Read more >
How to Perform One-Hot Encoding For Multi Categorical ...
They require all input variables and output variables to be numeric. This means that categorical data must be converted to a numerical form....
Read more >
How do I encode categorical features using scikit-learn?
In order to include categorical features in your Machine Learning model, you have to encode them numerically using "dummy" or "one-hot" ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found