question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Pipeline seems slower in 4.11+

See original GitHub issue

Hello! When I upgraded Transformers, I got a massive slowdown. Might be related to the new DataLoader used in Pipeline.

Happy to help!

Cheers,

Environment info

Environment

  • transformers version: 4.12.0.dev0
  • Platform: macOS-10.16-x86_64-i386-64bit
  • Python version: 3.8.12
  • PyTorch version (GPU?): 1.9.1 (False)
  • Tensorflow version (GPU?): 2.6.0 (False)
  • Flax version (CPU?/GPU?/TPU?): 0.3.5 (cpu)
  • Jax version: 0.2.24
  • JaxLib version: 0.1.73
  • Using GPU in script?: No
  • Using distributed or parallel set-up in script?: <fill in>

Who can help

Models:

  • ALBERT, BERT, XLM, DeBERTa, DeBERTa-v2, ELECTRA, MobileBert, SqueezeBert: @LysandreJik

Library:

Model I am using (Bert, XLNet …): DistilBert, but I suspect this is for all Pipeline.

The problem arises when using:

  • the official example scripts: (give details below)
  • my own modified scripts: (give details below)

The tasks I am working on is:

  • an official GLUE/SQUaD task: (give the name)
  • my own task or dataset: (give details below)

To reproduce

Steps to reproduce the behavior:

  1. I use the following script to predict on some random sentences:
from transformers import (
    AutoModelForSequenceClassification,
    AutoTokenizer,
    TextClassificationPipeline,
)


def get_pipeline():
    name = "distilbert-base-uncased-finetuned-sst-2-english"
    model = AutoModelForSequenceClassification.from_pretrained(name)
    tokenizer = AutoTokenizer.from_pretrained(name)
    return TextClassificationPipeline(tokenizer=tokenizer, model=model)


sentence = ["hello", "goodbye"] * 100
model = get_pipeline()
  1. The results that I get are widely different from Transformers 4.10 vs 4.11+
Version Command Time
HF 4.12.0.dev0 %timeit -n 3 model(sentence) Does not complete after 10 minutes.
HF 4.12.0.dev0 %timeit -n 3 model(sentence, num_workers=0) 4.67 s ± 153 ms per loop
HF 4.10.3 %timeit -n 3 model(sentence) 575 ms ± 10.8 ms per loop
HF 4.10.3 %timeit -n 3 model(sentence, num_workers=0) 500 ms ± 3.01 ms per loop

Expected behavior

I would expect the same performance if possible, or a way to bypass Pytorch DataLoader.

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:9 (3 by maintainers)

github_iconTop GitHub Comments

1reaction
Narsilcommented, Nov 1, 2021

Hi @alwayscurious ,

  1. The notebook linked does not have * 1000, effectively killing the measuring, is that just an omission or does it change the results ? The following assumes it actually modifies the results.
  2. In my modified script of your test (I used the same model as the pipelines example, with * 1000 added back), I get 100% GPU usage, but it takes 3mn the run the full thing while it takes 35s on the pipeline example. GPU usage is not everything here 😃.
  3. You are perfectly correct that the GPU is underused with the pipeline example, and we can push it on master transformers with pipeline(sentences, batch_size=64). Increasing the size of the batch does yield improved speed pretty fast and at some point it’s not worth putting bigger batches (when you saturate the GPU basically). Then the full thing runs under 5s on my home GTX 970.

You are reading 100% GPU usage but much lower speed on your colab example because all your examples are padded to 512 max length, so effectively the examples are super large for the GPU (keeping it busy) but it’s mostly doing useless work (hence 3mn instead of 35s)

The ~50% GPU utilization of the first example, is because the example+model is a bit small so no all the GPU is required to run, meaning part of the GPU is idle. However it’s still running faster that the “old” example, because it’s not wasting cyles on the padded tokens. If I remove the padding I fall back on roughly ~35s mentioned above. On larger models there would still probably be a difference linked to how the data is actually fed to the GPU but out of scope for this discussion.

By adding pipeline(sentences, batch_size=64) I am getting 5s runtime of the inference.

On a T4, you might be able to push the size of the batch even more, however I always tell users to be careful, running on mock data and real data is likely to be different, by adding more to the batch, you risk getting into OOM errors on live data that might be max_seq_len long, then the whole batch can be bigger. Even before OOM, if the data is highly unregular in terms of size the batching can hinder performance instead of helping it. Just like in the notebook, it’s filling your batch of pad_tokens. See this for the discussion: https://github.com/huggingface/transformers/blob/master/docs/source/main_classes/pipelines.rst.

Another knob you can turn is pipelines(..., num_workers=10) which is the number of threads used to feed the data to the GPU (it’s DataLoader’s argument) and might also help depending on your model/data configuration (rule of thumb is num_workers=number of CPU cores).

Did I omit anything in this analysis ?

0reactions
Dref360commented, Nov 1, 2021

I’ll close the issue now that it is merged on master.

Cheer

Read more comments on GitHub >

github_iconTop Results From Across the Web

Pipelines — transformers 4.11.3 documentation - Hugging Face
The pipeline abstraction is a wrapper around all the other available pipelines. ... resulting token will be used (with a warning, and that...
Read more >
Git checkout is slower than the command line execution
The git checkout is much slower even after using reference repositories, shallow clones etc. Running the same commands via the command line ...
Read more >
Readings: 4.1-4.11 - UCSD CSE
Understand how the pipeline affects performance ... A few arithmetic/Logical operations (Generalizing is straightforward) ... Slow Logic. Slow Logic.
Read more >
Troubleshooting Bitbucket Pipelines - Atlassian Documentation
Scenario 2.2: Builds using Jest Test Framework are slow or frequently ... Scenario 4: Bitbucket pipeline is failing with Exceeded build time ......
Read more >
Chapter 4 Solutions - Elsevier
This is slower than the 250*1.075*n required on the pipeline with no forwarding. 4.21.5 Speedup is not possible when the solution to 4.21.3...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found