question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. ItΒ collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

parallel execution of read/write operations (paper vs implementation)

See original GitHub issue

In Algorithm 1 of FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness, the inner loop starts by loading from HBM in SRAM a tile from tensor O into Oi. It then do some ops and save Oi from SRAM into HBM tensor O. For each iteration of the outer loop, the inner loop do a full round and therefore there is a full read / write of O tensor.

If the outer loop (j) is executed serially, it should provide the correct result. But when we execute in parallel the outer loop, it seems that we would have concurrent access to O for both reading and writing. Therefore, if we strictly follow the algorithm 1 and β€œjust” parallelize the outer loop, the output should not be correct.

In the triton implementation, they have switched inner and outer loop, therefore they don’t need to load Oi, they start with a zeroed one, do all computation and save to HBM the final result. Therefore they don’t have any concurrent access issue on this variable.

Can you please clarify the way the parallelization is implemented in CUDA? Is there something missing in the description above which may explain why algorithm 1 would work in parallel?

image

Issue Analytics

  • State:closed
  • Created a year ago
  • Reactions:1
  • Comments:6 (3 by maintainers)

github_iconTop GitHub Comments

2reactions
ZhongYingMatrixcommented, Nov 9, 2022

One could also swap the order of the inner and outer loops (as done in Triton).

Hi tridao, sorry for bothering you in a closed issue again. As Triton swaps the order of the loops, ori:

for 1 ≀ 𝑗 ≀ 𝑇𝑐 do
    Load K𝑗, V𝑗 from HBM to on-chip SRAM.  
    for 1 ≀𝑖 ≀ π‘‡π‘Ÿ do  
        Load Q𝑖, O𝑖 ,ℓ𝑖, π‘šπ‘– from HBM to on-chip SRA  

swapped:

for 1 ≀𝑖 ≀ π‘‡π‘Ÿ do
    for 1 ≀ 𝑗 ≀ 𝑇𝑐 do
        Load K𝑗, V𝑗 from HBM to on-chip SRAM.  
        Load Q𝑖, O𝑖 ,ℓ𝑖, π‘šπ‘– from HBM to on-chip SRA  

It seems that the loading of K/V increases to π‘‡π‘Ÿ times for the Triton version?

2reactions
tridaocommented, Aug 30, 2022

That’s a great observation! The CUDA code implements the algorithm as written in the paper, we do not parallelize the outer loop. Instead, we parallelize over the batch and nheads dimensions (each threadblock computes 1 head). This is sufficient if batch * nheads is large enough (around >= 80) so we have enough parallel work for each threadblock to do.

One could also swap the order of the inner and outer loops (as done in Triton).

Read more comments on GitHub >

github_iconTop Results From Across the Web

Best Practice Guide - Parallel I/O
Serial I/O operations should be limited to small data volume access ... I/O libraries provide APIs that enable parallel access to a single...
Read more >
Introduction to Parallel Computing Tutorial | HPC @ LLNL
Parallel tasks typically need to exchange data. There are several ways this can be accomplished, such as through a shared memory bus or...
Read more >
Parallel read and write in SQLite - Skoumal
This article show you how to have one writing thread and many reading threads at the same time. Unfortunately, simultaneous writing from more...
Read more >
Is Concurrency Really Increase the Performance?
Parallelism occurs on a multicore processor. Since there are multiple cores, it is possible to execute more than one thread at a time....
Read more >
Speeding Up Python with Concurrency, Parallelism, and asyncio
The last portion of this article will compare Python's implementation of async / await with how other languages have implemented them.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found