question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Hello,

Thank you for your awesome work! Flash attention is going to be used everywhere!

I have a few questions please:

  1. To use flash attention in an existing PyTorch transformer, it suffices to replace torch.nn.MultiheadAttention with flash_attn.flash_attention.FlashMHA, is that correct?
  2. Training is also supported out-of-the-box I guess? The question also includes mixed precision training, i.e., compatibility with torch.autocast() context manager.
  3. I see that you also provide a Fused Softmax implementation. According to the docstrings, this layer is used for auto-regressive models. If I only use the transformer encoder, e.g., vision transformers, then it’s not worth using it. Is that correct?

Thank you very much in advance for your answers.

Issue Analytics

  • State:closed
  • Created a year ago
  • Reactions:1
  • Comments:5 (2 by maintainers)

github_iconTop GitHub Comments

1reaction
tridaocommented, Nov 5, 2022
  1. To use flash attention in an existing PyTorch transformer, it suffices to replace torch.nn.MultiheadAttention with flash_attn.flash_attention.FlashMHA, is that correct?

Yes. The two modules do the same thing, though they might have different APIs and arguments. You should read the arguments and documents to pass in the right things.

  1. Training is also supported out-of-the-box I guess? The question also includes mixed precision training, i.e., compatibility with torch.autocast() context manager.

Yes, training works, and mixed-precision training works. torch.autocast() will do the right thing and cast the q, k, v to either fp16 or bf16. FlashAttention does not support fp32.

  1. I see that you also provide a Fused Softmax implementation. According to the docstrings, this layer is used for auto-regressive models. If I only use the transformer encoder, e.g., vision transformers, then it’s not worth using it. Is that correct?

The Fused Softmax was taken from apex/megatron purely for benchmarking. They’re only useful if either you have causal mask before softmax (e.g. autoregressive models) or key padding mask before softmax (e.g. BERT where sequences in a batch have different lengths). If you’re using transformer encoder and all sequences in the batch have the same length, then they’re won’t apply to your case.

0reactions
netw0rkf10wcommented, Nov 18, 2022

@tridao Great, thanks! Will create a PR soon!

Read more comments on GitHub >

github_iconTop Results From Across the Web

100 Getting to Know You Questions - SignUpGenius
100 Getting to Know You Questions · Who is your hero? · If you could live anywhere, where would it be? · What...
Read more >
450 Fun Questions to Ask People in ANY Situation (That Work!)
Deep Questions to Ask People · Who knows you best? · Where do you see yourself in 10 years? · What makes you...
Read more >
500 Good Questions to Ask - Conversation Starters World
GOOD QUESTIONS TO ASK · What weird food combinations do you really enjoy? · What social stigma does society need to get over?...
Read more >
400 Fun Questions to Ask People (Friends, Family, Strangers)
400 Wacky, Wild & Totally Fun Questions to Ask Anyone—Including Friends, Family & Even Strangers! Find a good, interesting, and random question ......
Read more >
272 Deep Questions to Ask: A Guy, Girl, Friend, or Anyone
One way is to ask them deep questions. So here are some deep questions you can ask different people--people like your partner, friends,...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found