question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

FillMaskPipeline very slow when provided with a large `targets`

See original GitHub issue

Environment info

  • transformers version: 4.6.1
  • Platform: Linux-5.4.0-67-generic-x86_64-with-glibc2.10
  • Python version: 3.8.5
  • PyTorch version (GPU?): 1.8.1 (False)
  • Tensorflow version (GPU?): N/A
  • Using GPU in script?: No
  • Using distributed or parallel set-up in script?: No

Who can help

@LysandreJik @Narsil

Information

The model I am using: ethanyt/guwenbert-base, with a RoBERTa model and a BertTokenizerFast tokenizer.

To reproduce

Steps to reproduce the behavior:

  1. Initialize a fill-mask pipeline with the model and the tokenizer mentioned above
  2. Call it with any sentence and a large targets (with a length of ~10k single words)

Problem

The call would be much slower than a similar call without a targets argument. A call without a targets argument costs ~0.1s, while a call with a targets argument costs ~0.3s.

The following code is present in src/transformers/pipelines/fill_mask.py:

class FillMaskPipeline(Pipeline):
    # ...
    def __call__(self, *args, targets=None, top_k: Optional[int] = None, **kwargs):
        # ...
        if targets is not None:
            # ...
            targets_proc = []
            for target in targets:
                target_enc = self.tokenizer.tokenize(target)
                # ...
                targets_proc.append(target_enc[0])

This function iterates through targets, rather than sending it directly to tokenize, which does not utilize the batch processing optimization of TokenizerFasts, hence the slow speed.

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:5 (3 by maintainers)

github_iconTop GitHub Comments

1reaction
Narsilcommented, Jun 11, 2021

I was able to reproduce and optimize away most of the performance, now any example should run at roughly the same speed.

Slowdown will happen when you miss the vocabulary, but the warnings should help users figure it out.

0reactions
EtaoinWucommented, Jun 11, 2021

Thanks a lot. As a background, I found the issue when reproducing the following paper:

Deng, Liming, et al. “An Iterative Polishing Framework Based on Quality Aware Masked Language Model for Chinese Poetry Generation.” Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 34. No. 05. 2020.

which involves calling FillMaskPipeline iteratively 10 times at most for each API call, which depending on the input, may or may not have the targets parameter. The time difference in the two types of API calls made me find this issue.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Pipelines - Hugging Face
If the provided targets are not in the model vocab, they will be tokenized and the first resulting token will be used (with...
Read more >
Notes on Transformers Book Ch. 9 - Christian Mills
We can fine-tune a language model on a large corpus of unlabeled data before ... The goal is to train a model that...
Read more >
Using huggingface fill-mask pipeline to get the "score" for a ...
I've been using huggingface to make predictions for masked tokens and it works great. I noticed that for each ...
Read more >
A pipeline for large raw text preprocessing and model training ...
Once the cleaning process has ended and the data are cleaned and formatted, the corpora is organized, decontaminated from sentences in the target...
Read more >
How to Maximize Retriever Performance on a More Natural ...
many questions lack context in absence of the provided paragraph and b.) there is a high lexical overlap between passages and questions ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found