Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Mismatch between tokenizer and model in pipeline

See original GitHub issue

Environment info

transformers version: 4.6.1
Platform: Linux-4.18.0-25-generic-x86_64-with-glibc2.10
Python version: 3.8.5
PyTorch version (GPU?): 1.8.1+cu102 (True)
Tensorflow version (GPU?): not installed (NA)
Using GPU in script?: Nope
Using distributed or parallel set-up in script?: Nope

Who can help

@LysandreJik

Information

Model I am using (Bert, XLNet …): Any models with pipeline.

The problem arises when using:

the official example scripts: (give details below)
my own modified scripts: (give details below)

The tasks I am working on is:

an official GLUE/SQUaD task: (give the name)
my own task or dataset: (give details below)

To reproduce

In [14]: p = pipeline("sentiment-analysis", tokenizer='cardiffnlp/twitter-roberta-base-sentiment')  # only tokenizer is provided

In [15]: p.tokenizer  # roberta as provided
Out[15]: PreTrainedTokenizerFast(name_or_path='cardiffnlp/twitter-roberta-base-sentiment', vocab_size=50265, model_max_len=1000000000000000019884624838656, is_fast=True, padding_side='right', special_tokens={'bos_token': '<s>', 'eos_token': '</s>', 'unk_token': '<unk>', 'sep_token': '</s>', 'pad_token': '<pad>', 'cls_token': '<s>', 'mask_token': AddedToken("<mask>", rstrip=False, lstrip=True, single_word=False, normalized=False)})

In [21]: p.model.config.model_type  # falls back to hard coded default model
Out[21]: 'distilbert'

In [22]: p("What a lovely day")  # does not work
Out[22]: [{'label': 'NEGATIVE', 'score': 0.8819105625152588}]

Expected behavior

The tokenizer and model should be compatible regardless of how arguments to pipeline are given. I think the if statements in the pipeline function should be something like below to handle all the cases.

if model is not None and tokenizer is not None:
    # case 1: use default     
elif model is None and tokenizer is None:
    # case 2: model should follow tokenizer
elif model is not None and tokenizer is None:
    # case 3: tokenizer should follow model
elif model is None and tokenizer is not None:
    # case 4: maybe assert if the two are compatible?

In addition, although the current code complains that we cannot infer tokenizer when model is given as a PreTrainedModel (one scenario under case 3), I think it is possible through AutoTokenizer.from_pretrained(model.config._name_or_path) as _name_or_path is `‘bert-base-cased’, for example.

Let me know how you think! I would be happy to submit a PR if consensus is reached 😃

Issue Analytics

State:
Created 2 years ago
Comments:9 (9 by maintainers)

Top GitHub Comments

1reaction

LysandreJikcommented, Jul 5, 2021

I think we could indeed raise an error/better warning if only a tokenizer is provided. When the model is provided, the tokenizer is selected automatically from that ID, which I agree is a bit weird as it doesn’t work the other way.

I think erroring out when the tokenizer is especially specified but not the model would be nice to prevent unseen errors from happening. Is there a use-case I’m not seeing @Narsil?

0reactions

hwijeencommented, Jul 7, 2021

You point makes sense to me – too much magic can complicate the issue.

I opened a PR #12548 that covers the first proposal. I tried to as descriptive as possible, please take a look 😃