Batch inference runtime slows down for inputs with different length sentences
See original GitHub issueEnvironment info
transformers
version: 4.6.1- Platform: Ubuntu 18.04.5 LTS
- Python version: 3.6.9
- PyTorch version (GPU?): 1.8.1
- Tensorflow version (GPU?):
- Using GPU in script?: yes
- Using distributed or parallel set-up in script?: no
Who can help
Information
Model I am using (Bert, XLNet …): LukeForEntityPairClassification
The problem arises when using:
- the official example scripts: (give details below)
- my own modified scripts: (give details below)
The tasks I am working on is:
- an official GLUE/SQUaD task: (give the name)
- my own task or dataset: (give details below)
To reproduce
Steps to reproduce the behavior:
- generate batched inputs for the LukeTokenizer with identical sentences in each batch (i.e. no padding required)
- tokenize each batch by passing the batch to the tokenizer
- run inference on each batch on GPU and notice that runtime is the same for each batch
- generate batched inputs for the LukeTokenizer with sentences of different length in each batch (i.e. padding is required)
- tokenize each batch by passing the batch to the tokenizer with
padding=True
- run inference on each batch on GPU and notice that runtime increases substantially for subsequent batches after first batch
import torch
from transformers import LukeForEntityPairClassification, LukeTokenizer
import time
text1 = "Beyoncé lives in Los Angeles."
entity_spans1 = [(0, 7), (17, 28)]
text2 = "Kevin Love has urged the Cleveland Cavaliers to fight to regain their form following LeBron James' move to the Los Angeles Lakers."
entity_spans2 = [(85, 97), (111, 129)]
# experiment 1 - sentence length is identical across the full batch
text = [[text1] * 10, [text2] * 10]
entity_spans = [[entity_spans1] * 10, [entity_spans2] * 10]
model = LukeForEntityPairClassification.from_pretrained("studio-ousia/luke-large-finetuned-tacred")
tokenizer = LukeTokenizer.from_pretrained("studio-ousia/luke-large-finetuned-tacred")
tokenized_inputs = []
for text_batch, span_batch in zip(text, entity_spans):
inputs = tokenizer(text_batch, entity_spans=span_batch, return_tensors="pt", padding=True, truncation=True)
tokenized_inputs.append(inputs)
device = torch.device('cuda')
model.to(device)
model.eval()
for i, batch in enumerate(tokenized_inputs):
with torch.no_grad():
start = time.time()
batch.to(device)
outputs = model(**batch)
print(f"runtime batch {i}: ", time.time() - start)
# experiment 2 - sentence length alternates in length across the batch
text = [[text1, text2] * 10] * 2
entity_spans = [[entity_spans1, entity_spans2] * 10] * 2
model = LukeForEntityPairClassification.from_pretrained("studio-ousia/luke-large-finetuned-tacred")
tokenizer = LukeTokenizer.from_pretrained("studio-ousia/luke-large-finetuned-tacred")
tokenized_inputs = []
for text_batch, span_batch in zip(text, entity_spans):
inputs = tokenizer(text_batch, entity_spans=span_batch, return_tensors="pt", padding=True, truncation=True)
tokenized_inputs.append(inputs)
device = torch.device('cuda')
model.to(device)
model.eval()
for i, batch in enumerate(tokenized_inputs):
with torch.no_grad():
start = time.time()
batch.to(device)
outputs = model(**batch)
print(f"runtime batch {i}: ", time.time() - start)
# results - Tesla T4
runtime batch 0: 0.028860092163085938
runtime batch 1: 0.03273129463195801
runtime batch 0: 0.028328895568847656
runtime batch 1: 0.09934639930725098
Expected behavior
I expect the runtime to be the same for an identical batch of inputs
Issue Analytics
- State:
- Created 2 years ago
- Reactions:2
- Comments:10 (4 by maintainers)
Top Results From Across the Web
Handling multiple sequences - Hugging Face Course
How do we handle multiple sequences of different lengths? ... Batching is the act of sending multiple sentences through the model, all at...
Read more >Batch inference · Issue #7178 · microsoft/onnxruntime - GitHub
Batch inference is slower than sheet inference on GPU 3080, Java version is 1.8.0_181, Onnxruntime version is 1.7.0 .
Read more >An empirical approach to speedup your BERT inference with ...
Inference takes a relatively long time compared to more modest models and it may be too slow to achieve the throughput you need....
Read more >How to perform Batch inferencing with RoBERTa ONNX ...
I want to understand how to get batch predictions using ONNX Runtime inference session by passing multiple inputs to the session.
Read more >Speeding up Transformer CPU inference in Google Cloud
The ONNX Runtime maintainers found several potential causes for the slow down and shared two suggestions: Upgrade to the latest version because ...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Pinging @NielsRogge as he might have an idea of what’s going on with LUKE
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.