Pipelines returns inconsistent results when using non-default model
See original GitHub issueSystem Info
Transformers version 4.19.2 Python 3.7.13 Ubuntu 16.04.6 LTS
Who can help?
I’ve noticed that pipeline
returns inconsistent results, after re-instantiating it, when supplying a non-standard model. See code below.
- What is being returned and why does it change?
- What exactly does
pipeline
do when you give it a non-default model or a model not trained for the specific task? - Since it doesn’t necessarily make sense to use
bert-base-uncased
for a sentiment analysis task, should pipeline allow this? I don’t get a warning or error. Is there a recommended way to tell pipeline to fail if the supplied model doesn’t make sense?
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examples
folder (such as GLUE/SQuAD, …) - My own task or dataset (give details below)
Reproduction
>>> from transformers import pipeline
>>> pipe = pipeline("sentiment-analysis", model="bert-base-uncased")
>>> pipe("This restaurant is awesome")
[{'label': 'LABEL_0', 'score': 0.5899267196655273}]
>>> pipe = pipeline("sentiment-analysis", model="bert-base-uncased")
>>> pipe("This restaurant is awesome")
[{'label': 'LABEL_0', 'score': 0.5623320937156677}]
>>> pipe = pipeline("sentiment-analysis", model="bert-base-uncased")
>>> pipe("This restaurant is awesome")
[{'label': 'LABEL_1', 'score': 0.5405012369155884}]
Expected behavior
I would expect pipeline to either fail or give a warning message if given a model not trained for the task.
Issue Analytics
- State:
- Created a year ago
- Comments:11 (7 by maintainers)
Top Results From Across the Web
Review test results - Azure Pipelines | Microsoft Learn
Test reports provide an effective and consistent way to view the tests results executed using different test frameworks, in order to measure ...
Read more >Pipeline Loading Models and Tokenizers for Q&A
Thanks @dennlinger I'm able to load now but not able to use. Either when I try to fit both the loaded tokenizer and...
Read more >run dali pipeline reader in CPU and GPU got inconsistent ...
I'm running a segmentation model using dali readers in CPU mode and GPU mode. It turns out that, using CPU got inconsistent compared...
Read more >Professional Data Engineer on Google Cloud Platform Exam ...
Dataflow pipelines can be programmed in Java. Dataflow pipelines use a unified programming model, so can work both with streaming and batch data...
Read more >Associate a Content Recommendations (CR) Model With a ...
When a Coveo Machine Learning (Coveo ML) Content Recommendations (CR) model has been created, it must be associated with a query pipeline to...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
By default the classification is created randomly. Then the correct weights are placed onto your model. Since those weights are missing we just don’t place them. That’s why outputs change all the time. the head is different all the times.
@sjgiorgi
I do agree that it’s easy to miss warnings, especially when running setups automatically and serving them for instance, those warnings might not be readily visible to you.
The real culprit here, is that the model architecture you are trying to load is actually very capable of running the pipeline. But the model weights themselves are missing the layers the architecture is looking for (here it doesn’t have the classification head).
Catching the warning would be the best way to be 100% sure it works that way.
Pinging a core maintainer to see if we have other solutions. My personal idea would be to enable a flag to raise a hard error on mismatched weights instead of a warning, and using that flag in pipelines because we really don’t want to load from pretrained an incomplete model. It’s a different story in Model.from_pretrained where it’s actually a desired feature if you intend to finetune,
@sgugger maybe ?