Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Pipeline seems slower in 4.11+

See original GitHub issue

Hello! When I upgraded Transformers, I got a massive slowdown. Might be related to the new DataLoader used in Pipeline.

Happy to help!

Cheers,

Environment info

Environment

transformers version: 4.12.0.dev0
Platform: macOS-10.16-x86_64-i386-64bit
Python version: 3.8.12
PyTorch version (GPU?): 1.9.1 (False)
Tensorflow version (GPU?): 2.6.0 (False)
Flax version (CPU?/GPU?/TPU?): 0.3.5 (cpu)
Jax version: 0.2.24
JaxLib version: 0.1.73
Using GPU in script?: No
Using distributed or parallel set-up in script?: <fill in>

Who can help

Models:

ALBERT, BERT, XLM, DeBERTa, DeBERTa-v2, ELECTRA, MobileBert, SqueezeBert: @LysandreJik

Library:

Pipelines: @Narsil

Model I am using (Bert, XLNet …): DistilBert, but I suspect this is for all Pipeline.

The problem arises when using:

the official example scripts: (give details below)
my own modified scripts: (give details below)

The tasks I am working on is:

an official GLUE/SQUaD task: (give the name)
my own task or dataset: (give details below)

To reproduce

Steps to reproduce the behavior:

I use the following script to predict on some random sentences:

from transformers import (
    AutoModelForSequenceClassification,
    AutoTokenizer,
    TextClassificationPipeline,
)


def get_pipeline():
    name = "distilbert-base-uncased-finetuned-sst-2-english"
    model = AutoModelForSequenceClassification.from_pretrained(name)
    tokenizer = AutoTokenizer.from_pretrained(name)
    return TextClassificationPipeline(tokenizer=tokenizer, model=model)


sentence = ["hello", "goodbye"] * 100
model = get_pipeline()

The results that I get are widely different from Transformers 4.10 vs 4.11+

Version	Command	Time
HF 4.12.0.dev0	%timeit -n 3 model(sentence)	Does not complete after 10 minutes.
HF 4.12.0.dev0	%timeit -n 3 model(sentence, num_workers=0)	4.67 s ± 153 ms per loop
HF 4.10.3	%timeit -n 3 model(sentence)	575 ms ± 10.8 ms per loop
HF 4.10.3	%timeit -n 3 model(sentence, num_workers=0)	500 ms ± 3.01 ms per loop

Expected behavior

I would expect the same performance if possible, or a way to bypass Pytorch DataLoader.

Issue Analytics

State:
Created 2 years ago
Comments:9 (3 by maintainers)

Top GitHub Comments

1reaction

Narsilcommented, Nov 1, 2021

Hi @alwayscurious ,

The notebook linked does not have * 1000, effectively killing the measuring, is that just an omission or does it change the results ? The following assumes it actually modifies the results.
In my modified script of your test (I used the same model as the pipelines example, with * 1000 added back), I get 100% GPU usage, but it takes 3mn the run the full thing while it takes 35s on the pipeline example. GPU usage is not everything here 😃.
You are perfectly correct that the GPU is underused with the pipeline example, and we can push it on master transformers with pipeline(sentences, batch_size=64). Increasing the size of the batch does yield improved speed pretty fast and at some point it’s not worth putting bigger batches (when you saturate the GPU basically). Then the full thing runs under 5s on my home GTX 970.

You are reading 100% GPU usage but much lower speed on your colab example because all your examples are padded to 512 max length, so effectively the examples are super large for the GPU (keeping it busy) but it’s mostly doing useless work (hence 3mn instead of 35s)

The ~50% GPU utilization of the first example, is because the example+model is a bit small so no all the GPU is required to run, meaning part of the GPU is idle. However it’s still running faster that the “old” example, because it’s not wasting cyles on the padded tokens. If I remove the padding I fall back on roughly ~35s mentioned above. On larger models there would still probably be a difference linked to how the data is actually fed to the GPU but out of scope for this discussion.

By adding pipeline(sentences, batch_size=64) I am getting 5s runtime of the inference.

On a T4, you might be able to push the size of the batch even more, however I always tell users to be careful, running on mock data and real data is likely to be different, by adding more to the batch, you risk getting into OOM errors on live data that might be max_seq_len long, then the whole batch can be bigger. Even before OOM, if the data is highly unregular in terms of size the batching can hinder performance instead of helping it. Just like in the notebook, it’s filling your batch of pad_tokens. See this for the discussion: https://github.com/huggingface/transformers/blob/master/docs/source/main_classes/pipelines.rst.

Another knob you can turn is pipelines(..., num_workers=10) which is the number of threads used to feed the data to the GPU (it’s DataLoader’s argument) and might also help depending on your model/data configuration (rule of thumb is num_workers=number of CPU cores).

Did I omit anything in this analysis ?

0reactions

Dref360commented, Nov 1, 2021

I’ll close the issue now that it is merged on master.

Cheer