Pipeline seems slower in 4.11+
See original GitHub issueHello! When I upgraded Transformers, I got a massive slowdown. Might be related to the new DataLoader used in Pipeline.
Happy to help!
Cheers,
Environment info
Environment
transformers
version: 4.12.0.dev0- Platform: macOS-10.16-x86_64-i386-64bit
- Python version: 3.8.12
- PyTorch version (GPU?): 1.9.1 (False)
- Tensorflow version (GPU?): 2.6.0 (False)
- Flax version (CPU?/GPU?/TPU?): 0.3.5 (cpu)
- Jax version: 0.2.24
- JaxLib version: 0.1.73
- Using GPU in script?: No
- Using distributed or parallel set-up in script?: <fill in>
Who can help
Models:
- ALBERT, BERT, XLM, DeBERTa, DeBERTa-v2, ELECTRA, MobileBert, SqueezeBert: @LysandreJik
Library:
- Pipelines: @Narsil
Model I am using (Bert, XLNet …): DistilBert, but I suspect this is for all Pipeline.
The problem arises when using:
- the official example scripts: (give details below)
- my own modified scripts: (give details below)
The tasks I am working on is:
- an official GLUE/SQUaD task: (give the name)
- my own task or dataset: (give details below)
To reproduce
Steps to reproduce the behavior:
- I use the following script to predict on some random sentences:
from transformers import (
AutoModelForSequenceClassification,
AutoTokenizer,
TextClassificationPipeline,
)
def get_pipeline():
name = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModelForSequenceClassification.from_pretrained(name)
tokenizer = AutoTokenizer.from_pretrained(name)
return TextClassificationPipeline(tokenizer=tokenizer, model=model)
sentence = ["hello", "goodbye"] * 100
model = get_pipeline()
- The results that I get are widely different from Transformers 4.10 vs 4.11+
Version | Command | Time |
---|---|---|
HF 4.12.0.dev0 | %timeit -n 3 model(sentence) | Does not complete after 10 minutes. |
HF 4.12.0.dev0 | %timeit -n 3 model(sentence, num_workers=0) | 4.67 s ± 153 ms per loop |
HF 4.10.3 | %timeit -n 3 model(sentence) | 575 ms ± 10.8 ms per loop |
HF 4.10.3 | %timeit -n 3 model(sentence, num_workers=0) | 500 ms ± 3.01 ms per loop |
Expected behavior
I would expect the same performance if possible, or a way to bypass Pytorch DataLoader.
Issue Analytics
- State:
- Created 2 years ago
- Comments:9 (3 by maintainers)
Top Results From Across the Web
Pipelines — transformers 4.11.3 documentation - Hugging Face
The pipeline abstraction is a wrapper around all the other available pipelines. ... resulting token will be used (with a warning, and that...
Read more >Git checkout is slower than the command line execution
The git checkout is much slower even after using reference repositories, shallow clones etc. Running the same commands via the command line ...
Read more >Readings: 4.1-4.11 - UCSD CSE
Understand how the pipeline affects performance ... A few arithmetic/Logical operations (Generalizing is straightforward) ... Slow Logic. Slow Logic.
Read more >Troubleshooting Bitbucket Pipelines - Atlassian Documentation
Scenario 2.2: Builds using Jest Test Framework are slow or frequently ... Scenario 4: Bitbucket pipeline is failing with Exceeded build time ......
Read more >Chapter 4 Solutions - Elsevier
This is slower than the 250*1.075*n required on the pipeline with no forwarding. 4.21.5 Speedup is not possible when the solution to 4.21.3...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Hi @alwayscurious ,
* 1000
, effectively killing the measuring, is that just an omission or does it change the results ? The following assumes it actually modifies the results.* 1000
added back), I get 100% GPU usage, but it takes 3mn the run the full thing while it takes 35s on the pipeline example. GPU usage is not everything here 😃.pipeline(sentences, batch_size=64)
. Increasing the size of the batch does yield improved speed pretty fast and at some point it’s not worth putting bigger batches (when you saturate the GPU basically). Then the full thing runs under 5s on my home GTX 970.You are reading 100% GPU usage but much lower speed on your colab example because all your examples are padded to
512
max length, so effectively the examples are super large for the GPU (keeping it busy) but it’s mostly doing useless work (hence 3mn instead of 35s)The ~50% GPU utilization of the first example, is because the example+model is a bit small so no all the GPU is required to run, meaning part of the GPU is idle. However it’s still running faster that the “old” example, because it’s not wasting cyles on the padded tokens. If I remove the padding I fall back on roughly
~35s
mentioned above. On larger models there would still probably be a difference linked to how the data is actually fed to the GPU but out of scope for this discussion.By adding
pipeline(sentences, batch_size=64)
I am getting5s
runtime of the inference.On a T4, you might be able to push the size of the batch even more, however I always tell users to be careful, running on mock data and real data is likely to be different, by adding more to the batch, you risk getting into OOM errors on live data that might be
max_seq_len
long, then the whole batch can be bigger. Even before OOM, if the data is highly unregular in terms of size the batching can hinder performance instead of helping it. Just like in the notebook, it’s filling your batch of pad_tokens. See this for the discussion: https://github.com/huggingface/transformers/blob/master/docs/source/main_classes/pipelines.rst.Another knob you can turn is
pipelines(..., num_workers=10)
which is the number of threads used to feed the data to the GPU (it’s DataLoader’s argument) and might also help depending on your model/data configuration (rule of thumb is num_workers=number of CPU cores).Did I omit anything in this analysis ?
I’ll close the issue now that it is merged on master.
Cheer