[docs]Incorrect way of input encoding for "multiple choice" models in documentation?
See original GitHub issueIn the documentation about “xxForMultipleChoice” models like BERT, ALBERT, RoBERTa, the examples goes like this:
>>> from transformers import BertTokenizer, BertForMultipleChoice
>>> import torch
>>> tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
>>> model = BertForMultipleChoice.from_pretrained('bert-base-uncased')
>>> prompt = "In Italy, pizza served in formal settings, such as at a restaurant, is presented unsliced."
>>> choice0 = "It is eaten with a fork and a knife."
>>> choice1 = "It is eaten while held in the hand."
>>> labels = torch.tensor(0).unsqueeze(0) # choice0 is correct (according to Wikipedia ;)), batch size 1
>>> encoding = tokenizer([[prompt, prompt], [choice0, choice1]], return_tensors='pt', padding=True)
>>> outputs = model(**{k: v.unsqueeze(0) for k,v in encoding.items()}, labels=labels) # batch size is 1
...
In current version (4.5.1), the encoding
actually consists of 2 sentences: [prompt + prompt]
, and [choice0 + choice1]
, which to my knowledge is incorrect, as each encoded sentence should include one prompt and one choice. I think the encoding
supposed to be like:
tokenizer([[prompt, choice0], [prompt, choice1]], return_tensors='pt', padding=True)
So, is there anything wrong?
Issue Analytics
- State:
- Created 2 years ago
- Comments:6 (2 by maintainers)
Top Results From Across the Web
Categorical Feature Encoding - Towards Data Science
Categorical feature encoding is often a key part of the data science process and can be done in multiple ways leading to different...
Read more >Encoding of categorical variables — Scikit-learn course
In this notebook, we will present typical ways of dealing with categorical variables by encoding them, namely ordinal encoding and one-hot encoding.
Read more >3 Ways to Encode Categorical Variables for Deep Learning
When using a model with multiple inputs, we will need to specify a list that has one dataset for each input, e.g. a...
Read more >How to Perform One-Hot Encoding For Multi Categorical ...
They require all input variables and output variables to be numeric. This means that categorical data must be converted to a numerical form....
Read more >How do I encode categorical features using scikit-learn?
In order to include categorical features in your Machine Learning model, you have to encode them numerically using "dummy" or "one-hot" ...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
This has been fixed, but you need to switch to the master documentation to see the change.
I’m running into this issue as well.
I’m not super familiar working with multiple choice models, but I think that given #6074, and the run_swag.py example, these should be passed as two lists, instead of one.
In other words, in the example, instead of
it should be
It would be awesome to make this two-character-deletion change, as it just tripped me up when starting working on a multiple choice model!