Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

BART.generate: possible to reduce time/memory?

See original GitHub issue

🐛 Performance issues

I did a quick benchmark between HuggingFace’s implementation of BART and FairSeq’s implementation.

You can find the benchmark code here.

Here is my results, on a single GPU GTX 1080 (12 GiB of memory) :

FP16 - Batch size 16	s/batch	s/sample
FairSeq	8.8676	0.5664
HuggingFace	12.3358	0.7879

FP16 - Batch size 32	s/batch	s/sample
FairSeq	17.1247	0.5469
HuggingFace	OOM	OOM

FP16 - Batch size 1	s/sample
FairSeq	1.6743
HuggingFace	1.8856

FP32 - Batch size 1	s/sample
FairSeq	1.7865
HuggingFace	2.0670

FairSeq is consistently faster than HuggingFace on all my experiments.

This sparks a few questions :

Do you have similar results on your side ? Did I mess my benchmark ?
Why HuggingFace’s implementation is significantly slower ?
Why HuggingFace’s implementation takes more space in memory (illustrated by OOM with batch size of 32) ?
Is the release of the Summarization Pipeline going to improve this ?

@sshleifer

Issue Analytics

State:
Created 4 years ago
Comments:5 (4 by maintainers)

Top GitHub Comments

1reaction

sshleifercommented, Mar 30, 2020

On master, the gap has closed considerably! <16GB GPU RAM for fp16, bs=32, and timings much closer:

My numbers are a bit lower than yours because I am on an NVIDIA RTX GPU.

1reaction

sshleifercommented, Mar 6, 2020

Identical to my benchmark for speed. Hadn’t tested memory but I’m not surprised that their implementation is less.

For both memory and speed, they have a lot of clever tricks that we haven’t implemented yet.

Summarization Pipeline will not help, but I will take a longer look at this tomorrow and see if we can improve.

Top Results From Across the Web

Bart — transformers 2.11.0 documentation - Hugging Face

BartForConditionalGeneration.generate should be used for conditional generation tasks like summarization, see the example in that docstrings.

Rigorous Bounds on Cryptanalytic Time/Memory Tradeo s - The ...

pre-image of f(x) is found by trying all the possible pre-images x ... Finally we show a similar lower bound for time/memory/data tradeo...

Program optimization - Wikipedia

In computer science, program optimization, code optimization, or software optimization, is the process of modifying a software system to make some aspect of...

Proceedings of the Third AES Candidate Conference

B eing able to pipeline sub k ey generation at the same rate as encryption allows sub k eys to be generated concurrent...

Understanding Cryptography by Christof Paar

Bart Preneel's willingness to provide the Foreword is a great honor for us ... the encryption algorithm secret should make the whole system...