question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[Questions] Triton blocksparse flashattention & quantization

See original GitHub issue

Hello @tridao ,

Congratz on the work on FlashAttention. It seems to already have some huge impact, being integrated in kernl, pytorch’s nightlies BetterTransformer and I’m sure others.

I had two questions:

  • Are you aware of an implementation of blocksparse flashattention in OpenAI Triton? Are there any benchmark available of possible speedups / loss in prediction quality depending on the sparsity ratio of the mask matrices? Currently as pointed out in an other issue it seems like DeepSpeed’s FixedSparsityConfig is used.
  • Is there any effort to implement flashattention / blocksparse flashattention with integer arithmetic (e.g. int8 GEMM)? Do you think it could be worthwhile throughput-wise?

Pinging @pommedeterresautee if you work on these topics or have some insighs!

Thanks a lot!

Issue Analytics

  • State:open
  • Created 10 months ago
  • Comments:5 (3 by maintainers)

github_iconTop GitHub Comments

2reactions
tridaocommented, Nov 28, 2022

Congratz on the work on FlashAttention. It seems to already have some huge impact, being integrated in kernl, pytorch’s nightlies BetterTransformer and I’m sure others.

Thanks for the kind words, we’ve been very happy to see FlashAttention being used in many places.

I had two questions:

  • Are you aware of an implementation of blocksparse flashattention in OpenAI Triton? Are there any benchmark available of possible speedups / loss in prediction quality depending on the sparsity ratio of the mask matrices? Currently as pointed out in an other issue it seems like DeepSpeed’s FixedSparsityConfig is used.

I’m not aware if blocksparse FlashAttention is implemented in Triton, but that seems like a good idea! Triton has blocksparse matrix multiply implemented.

In terms of speedup, we observed speedup proportional to the density (e.g. if 20% of the blocks are nonzero, then attention goes 5x faster).

In terms of quality, it’s hard to say. For simpler task (e.g. Long-range arena) block sparse attention seems to do about as well as dense attention. For language modeling, the GPT3 paper says they alternate dense and block sparse attention, and I think GPT-J alternates dense and local (a form of sparsity) attention. However, in general sparse attention hasn’t been as widely used as dense attention. Maybe for really long sequences we could have a good use case?

  • Is there any effort to implement flashattention / blocksparse flashattention with integer arithmetic (e.g. int8 GEMM)? Do you think it could be worthwhile throughput-wise?

This is a good idea! I think it would make attention go twice as fast, since you would need to load 2x fewer bytes (for both global memory loading and shared memory loading).

1reaction
tridaocommented, Dec 5, 2022

How should we maintain the numerical stability while quantizing and dequantizing P in each step of flash attention? Normally, softmax results will be quantized for the 2nd GEMM. Using fp16 for the 2nd GEMM might be a choice, but I am unsure if there is a way to keep int8 GEMMs for both QK and PV.

I haven’t thought too much about this but this is a general problem even if you implement it in Pytorch (you would have to decide how to quantize P). For training I think fp16 could possibly work (idk if Nvidia’s Transformer Engine converts P to fp8 for H100). For inference Int8 could work, as done in FasterTransformer I think. In the end one would probably have to try it out so see.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Issues · HazyResearch/flash-attention - GitHub
Contribute to HazyResearch/flash-attention development by creating an account on GitHub. ... [Questions] Triton blocksparse flashattention & quantization.
Read more >
2022 Speakers - MLOps World
FlashAttention and block-sparse FlashAttention enable longer context in Transformers, yielding higher quality models (0.7 better perplexity on GPT-2 and 6.4 ...
Read more >
Fast and memory-efficient exact attention - CppDig
FlashAttention This repository provides the official implementation of ... [Questions] Triton blocksparse flashattention & quantization.
Read more >
HazyResearch/flash-attention - Issues Antenna
73HazyResearch/flash-attention issues, can help you solve programming problems. ... [Questions] Triton blocksparse flashattention & quantization.
Read more >
Facebookresearch Xformers Statistics & Issues - Codesti
2020, such as quantization of the direction of arrival (DOA) estimates to ... facebookresearch/aqa-study: This repo contains the data (question/answer pairs ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found