Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

parallel execution of read/write operations (paper vs implementation)

See original GitHub issue

In Algorithm 1 of FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness, the inner loop starts by loading from HBM in SRAM a tile from tensor O into Oi. It then do some ops and save Oi from SRAM into HBM tensor O. For each iteration of the outer loop, the inner loop do a full round and therefore there is a full read / write of O tensor.

If the outer loop (j) is executed serially, it should provide the correct result. But when we execute in parallel the outer loop, it seems that we would have concurrent access to O for both reading and writing. Therefore, if we strictly follow the algorithm 1 and “just” parallelize the outer loop, the output should not be correct.

In the triton implementation, they have switched inner and outer loop, therefore they don’t need to load Oi, they start with a zeroed one, do all computation and save to HBM the final result. Therefore they don’t have any concurrent access issue on this variable.

Can you please clarify the way the parallelization is implemented in CUDA? Is there something missing in the description above which may explain why algorithm 1 would work in parallel?

Issue Analytics

State:
Created a year ago
Reactions:1
Comments:6 (3 by maintainers)

Top GitHub Comments

2reactions

ZhongYingMatrixcommented, Nov 9, 2022

One could also swap the order of the inner and outer loops (as done in Triton).

Hi tridao, sorry for bothering you in a closed issue again. As Triton swaps the order of the loops, ori:

for 1 ≤ 𝑗 ≤ 𝑇𝑐 do
    Load K𝑗, V𝑗 from HBM to on-chip SRAM.  
    for 1 ≤𝑖 ≤ 𝑇𝑟 do  
        Load Q𝑖, O𝑖 ,ℓ𝑖, 𝑚𝑖 from HBM to on-chip SRA

swapped:

for 1 ≤𝑖 ≤ 𝑇𝑟 do
    for 1 ≤ 𝑗 ≤ 𝑇𝑐 do
        Load K𝑗, V𝑗 from HBM to on-chip SRAM.  
        Load Q𝑖, O𝑖 ,ℓ𝑖, 𝑚𝑖 from HBM to on-chip SRA

It seems that the loading of K/V increases to 𝑇𝑟 times for the Triton version?

2reactions

tridaocommented, Aug 30, 2022

That’s a great observation! The CUDA code implements the algorithm as written in the paper, we do not parallelize the outer loop. Instead, we parallelize over the batch and nheads dimensions (each threadblock computes 1 head). This is sufficient if batch * nheads is large enough (around >= 80) so we have enough parallel work for each threadblock to do.

One could also swap the order of the inner and outer loops (as done in Triton).