Mismatch between tokenizer and model in pipeline
See original GitHub issueEnvironment info
transformers
version: 4.6.1- Platform: Linux-4.18.0-25-generic-x86_64-with-glibc2.10
- Python version: 3.8.5
- PyTorch version (GPU?): 1.8.1+cu102 (True)
- Tensorflow version (GPU?): not installed (NA)
- Using GPU in script?: Nope
- Using distributed or parallel set-up in script?: Nope
Who can help
Information
Model I am using (Bert, XLNet …): Any models with pipeline
.
The problem arises when using:
- the official example scripts: (give details below)
- my own modified scripts: (give details below)
The tasks I am working on is:
- an official GLUE/SQUaD task: (give the name)
- my own task or dataset: (give details below)
To reproduce
In [14]: p = pipeline("sentiment-analysis", tokenizer='cardiffnlp/twitter-roberta-base-sentiment') # only tokenizer is provided
In [15]: p.tokenizer # roberta as provided
Out[15]: PreTrainedTokenizerFast(name_or_path='cardiffnlp/twitter-roberta-base-sentiment', vocab_size=50265, model_max_len=1000000000000000019884624838656, is_fast=True, padding_side='right', special_tokens={'bos_token': '<s>', 'eos_token': '</s>', 'unk_token': '<unk>', 'sep_token': '</s>', 'pad_token': '<pad>', 'cls_token': '<s>', 'mask_token': AddedToken("<mask>", rstrip=False, lstrip=True, single_word=False, normalized=False)})
In [21]: p.model.config.model_type # falls back to hard coded default model
Out[21]: 'distilbert'
In [22]: p("What a lovely day") # does not work
Out[22]: [{'label': 'NEGATIVE', 'score': 0.8819105625152588}]
Expected behavior
The tokenizer and model should be compatible regardless of how arguments to pipeline are given.
I think the if statements in the pipeline
function should be something like below to handle all the cases.
if model is not None and tokenizer is not None:
# case 1: use default
elif model is None and tokenizer is None:
# case 2: model should follow tokenizer
elif model is not None and tokenizer is None:
# case 3: tokenizer should follow model
elif model is None and tokenizer is not None:
# case 4: maybe assert if the two are compatible?
In addition, although the current code complains that we cannot infer tokenizer when model is given as a PreTrainedModel
(one scenario under case 3), I think it is possible through AutoTokenizer.from_pretrained(model.config._name_or_path)
as _name_or_path
is `‘bert-base-cased’, for example.
Let me know how you think! I would be happy to submit a PR if consensus is reached 😃
Issue Analytics
- State:
- Created 2 years ago
- Comments:9 (9 by maintainers)
Top Results From Across the Web
Pipelines - Hugging Face
A tokenizer in charge of mapping raw textual input to token. A model to make predictions from the inputs. Some (optional) post processing...
Read more >Dimension mismatch error with scikit pipeline FeatureUnion
I've been trying to combine features with FeatureUnion and Pipeline, but when I add a tf-idf + svd piepline the test fails with...
Read more >Training Devanagari Language Models on TPU using ...
This article describes a bottom-up approach to training models for Devanagari languages like Marathi, Hindi, and Sanskrit using HuggingFace.
Read more >Working of analyzers, tokenizers, and filters | Apache Solr ...
Tokenizers and filters are combined to form a pipeline or chain where the output from one tokenizer or filter acts as an input...
Read more >Source code for bentoml._internal.frameworks.transformers
Any) -> None: """ Validate the type of the given pipeline definition. ... Pipeline'. " "To save other Transformers types like models, tokenizers,...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
I think we could indeed raise an error/better warning if only a tokenizer is provided. When the model is provided, the tokenizer is selected automatically from that ID, which I agree is a bit weird as it doesn’t work the other way.
I think erroring out when the
tokenizer
is especially specified but not the model would be nice to prevent unseen errors from happening. Is there a use-case I’m not seeing @Narsil?You point makes sense to me – too much magic can complicate the issue.
I opened a PR #12548 that covers the first proposal. I tried to as descriptive as possible, please take a look 😃