QuestionAnsweringPipeline query performance
See original GitHub issueThis is my first issue posted here, so first off thank you for building this library, it’s really pushing NLP forward.
The current QuestionAnsweringPipeline relies on the method squad_convert_examples_to_features to convert question/context pairs to SquadFeatures. In reviewing this method, it looks like it spawns a process for each example.
This is causing performance issues when looking to support near real-time queries or bulk queries. As a workaround, I can directly issue the queries against the model but the pipeline has a lot of nice logic to help format answers properly and pulling the best answer vs start/end argmax.
Please see the results of a rudimentary performance test to demonstrate:
import time
from transformers import pipeline
context = r"""
The extractive question answering process took an average of 36.555 seconds using pipelines and about 2 seconds when
queried directly using the models.
"""
question = "How long did the process take?"
nlp = pipeline("question-answering", model="distilbert-base-cased-distilled-squad", tokenizer="distilbert-base-cased-distilled-squad")
start = time.time()
for x in range(100):
answer = nlp(question=question, context=context)
print("Answer", answer)
print("Time", time.time() - start, "s")
Answer {'score': 0.8029816785368773, 'start': 62, 'end': 76, 'answer': '36.555 seconds'}
Time 36.703474044799805 s
import torch
from transformers import pipeline, AutoModelForQuestionAnswering, AutoTokenizer
model = AutoModelForQuestionAnswering.from_pretrained("distilbert-base-cased-distilled-squad")
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-cased-distilled-squad")
start = time.time()
for x in range(100):
inputs = tokenizer.encode_plus(question, context, add_special_tokens=True, return_tensors="pt")
input_ids = inputs["input_ids"].tolist()[0]
text_tokens = tokenizer.convert_ids_to_tokens(input_ids)
answer_start_scores, answer_end_scores = model(**inputs)
answer_start = torch.argmax(
answer_start_scores
) # Get the most likely beginning of answer with the argmax of the score
answer_end = torch.argmax(answer_end_scores) + 1 # Get the most likely end of answer with the argmax of the score
answer = tokenizer.convert_tokens_to_string(tokenizer.convert_ids_to_tokens(input_ids[answer_start:answer_end]))
print("Answer", answer)
print("Time", time.time() - start, "s")
Answer 36 . 555 seconds
Time 2.1718859672546387 s
I believe the 10x slowdown is that the first example had to spawn 100 processes. I also tried passing a list of 100 question/context pairs to see if that was better and that took ~28s. But for this use case, all 100 questions wouldn’t be available at once.
The additional logic for answer extraction doesn’t come for free but it doesn’t add much overhead. The third test below uses a custom pipeline component to demonstrate.
from cord19q.pipeline import Pipeline
pipeline = Pipeline("distilbert-base-cased-distilled-squad", False)
start = time.time()
for x in range(100):
answer = pipeline([question], [context])
print("\nAnswer", answer)
print("Time", time.time() - start, "s")
Answer [{'answer': '36.555 seconds', 'score': 0.8029860216482803}]
Time 2.219379186630249 s
It would be great if the QuestionAnsweringPipeline could either not use the squad processor or the processor is changed to have an argument to not spawn processes.
Issue Analytics
- State:
- Created 3 years ago
- Comments:11 (8 by maintainers)
Top GitHub Comments
Hi @davidmezzetti, just to let you know we’re working towards a bigger pipeline refactor, with a strong focus on performance. Let’s keep this issue open while it’s still in the works in case more is to be said on the matter.
Glad to hear it!