[Questions] Triton blocksparse flashattention & quantization
See original GitHub issueHello @tridao ,
Congratz on the work on FlashAttention. It seems to already have some huge impact, being integrated in kernl, pytorch’s nightlies BetterTransformer and I’m sure others.
I had two questions:
- Are you aware of an implementation of blocksparse flashattention in OpenAI Triton? Are there any benchmark available of possible speedups / loss in prediction quality depending on the sparsity ratio of the mask matrices? Currently as pointed out in an other issue it seems like DeepSpeed’s
FixedSparsityConfigis used. - Is there any effort to implement flashattention / blocksparse flashattention with integer arithmetic (e.g. int8 GEMM)? Do you think it could be worthwhile throughput-wise?
Pinging @pommedeterresautee if you work on these topics or have some insighs!
Thanks a lot!
Issue Analytics
- State:
- Created 10 months ago
- Comments:5 (3 by maintainers)
Top Results From Across the Web
Issues · HazyResearch/flash-attention - GitHub
Contribute to HazyResearch/flash-attention development by creating an account on GitHub. ... [Questions] Triton blocksparse flashattention & quantization.
Read more >2022 Speakers - MLOps World
FlashAttention and block-sparse FlashAttention enable longer context in Transformers, yielding higher quality models (0.7 better perplexity on GPT-2 and 6.4 ...
Read more >Fast and memory-efficient exact attention - CppDig
FlashAttention This repository provides the official implementation of ... [Questions] Triton blocksparse flashattention & quantization.
Read more >HazyResearch/flash-attention - Issues Antenna
73HazyResearch/flash-attention issues, can help you solve programming problems. ... [Questions] Triton blocksparse flashattention & quantization.
Read more >Facebookresearch Xformers Statistics & Issues - Codesti
2020, such as quantization of the direction of arrival (DOA) estimates to ... facebookresearch/aqa-study: This repo contains the data (question/answer pairs ...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found

Thanks for the kind words, we’ve been very happy to see FlashAttention being used in many places.
I’m not aware if blocksparse FlashAttention is implemented in Triton, but that seems like a good idea! Triton has blocksparse matrix multiply implemented.
In terms of speedup, we observed speedup proportional to the density (e.g. if 20% of the blocks are nonzero, then attention goes 5x faster).
In terms of quality, it’s hard to say. For simpler task (e.g. Long-range arena) block sparse attention seems to do about as well as dense attention. For language modeling, the GPT3 paper says they alternate dense and block sparse attention, and I think GPT-J alternates dense and local (a form of sparsity) attention. However, in general sparse attention hasn’t been as widely used as dense attention. Maybe for really long sequences we could have a good use case?
This is a good idea! I think it would make attention go twice as fast, since you would need to load 2x fewer bytes (for both global memory loading and shared memory loading).
I haven’t thought too much about this but this is a general problem even if you implement it in Pytorch (you would have to decide how to quantize P). For training I think fp16 could possibly work (idk if Nvidia’s Transformer Engine converts P to fp8 for H100). For inference Int8 could work, as done in FasterTransformer I think. In the end one would probably have to try it out so see.