Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

QuestionAnsweringPipeline query performance

See original GitHub issue

This is my first issue posted here, so first off thank you for building this library, it’s really pushing NLP forward.

The current QuestionAnsweringPipeline relies on the method squad_convert_examples_to_features to convert question/context pairs to SquadFeatures. In reviewing this method, it looks like it spawns a process for each example.

This is causing performance issues when looking to support near real-time queries or bulk queries. As a workaround, I can directly issue the queries against the model but the pipeline has a lot of nice logic to help format answers properly and pulling the best answer vs start/end argmax.

Please see the results of a rudimentary performance test to demonstrate:

import time

from transformers import pipeline

context = r"""
The extractive question answering process took an average of 36.555 seconds using pipelines and about 2 seconds when
queried directly using the models.
"""
question = "How long did the process take?"

nlp = pipeline("question-answering", model="distilbert-base-cased-distilled-squad", tokenizer="distilbert-base-cased-distilled-squad")

start = time.time()
for x in range(100):
    answer = nlp(question=question, context=context)

print("Answer", answer)
print("Time", time.time() - start, "s")

Answer {'score': 0.8029816785368773, 'start': 62, 'end': 76, 'answer': '36.555 seconds'}
Time 36.703474044799805 s

import torch

from transformers import pipeline, AutoModelForQuestionAnswering, AutoTokenizer

model = AutoModelForQuestionAnswering.from_pretrained("distilbert-base-cased-distilled-squad")
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-cased-distilled-squad")

start = time.time()
for x in range(100):
    inputs = tokenizer.encode_plus(question, context, add_special_tokens=True, return_tensors="pt")
    input_ids = inputs["input_ids"].tolist()[0]

    text_tokens = tokenizer.convert_ids_to_tokens(input_ids)
    answer_start_scores, answer_end_scores = model(**inputs)

    answer_start = torch.argmax(
        answer_start_scores
    )  # Get the most likely beginning of answer with the argmax of the score
    answer_end = torch.argmax(answer_end_scores) + 1  # Get the most likely end of answer with the argmax of the score

    answer = tokenizer.convert_tokens_to_string(tokenizer.convert_ids_to_tokens(input_ids[answer_start:answer_end]))

print("Answer", answer)
print("Time", time.time() - start, "s")

Answer 36 . 555 seconds
Time 2.1718859672546387 s

I believe the 10x slowdown is that the first example had to spawn 100 processes. I also tried passing a list of 100 question/context pairs to see if that was better and that took ~28s. But for this use case, all 100 questions wouldn’t be available at once.

The additional logic for answer extraction doesn’t come for free but it doesn’t add much overhead. The third test below uses a custom pipeline component to demonstrate.

from cord19q.pipeline import Pipeline

pipeline = Pipeline("distilbert-base-cased-distilled-squad", False)

start = time.time()
for x in range(100):
    answer = pipeline([question], [context])

print("\nAnswer", answer)
print("Time", time.time() - start, "s")

Answer [{'answer': '36.555 seconds', 'score': 0.8029860216482803}]
Time 2.219379186630249 s

It would be great if the QuestionAnsweringPipeline could either not use the squad processor or the processor is changed to have an argument to not spawn processes.

Issue Analytics

State:
Created 3 years ago
Comments:11 (8 by maintainers)

Top GitHub Comments

4reactions

LysandreJikcommented, Aug 3, 2020

Hi @davidmezzetti, just to let you know we’re working towards a bigger pipeline refactor, with a strong focus on performance. Let’s keep this issue open while it’s still in the works in case more is to be said on the matter.

1reaction

LysandreJikcommented, Dec 7, 2020

Glad to hear it!

Top Results From Across the Web

Pipelines - Hugging Face

Measure performance on your load, with your hardware. Measure, measure, and keep measuring. Real numbers are the only way to go. If you...

Parameter-Tweaking: Get Faster Answers from Your Haystack ...

This article is the first in our series on optimizing Haystack. Learn how to configure a parameter-rich Haystack question answering ...

Answering Questions with HuggingFace Pipelines and Streamlit

See how easy it can be to build a simple web app for question answering from text using Streamlit and HuggingFace pipelines.

Question Answering on Tabular Data with HuggingFace ...

It achieves state-of-the-art on both SQA and WTQ, while having comparable performance to SOTA on WikiSQL, with a much simpler architecture.

Table Question Answering - Overview

The retriever transforms natural language queries and tabular data into ... and tokenizer from the Huggingface model hub into a question-answering pipeline.