Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Illegal Memory Access when number of keys != queries

See original GitHub issue

I’m experimenting with a scenario where I have more keys than queries (e.g., a common scenario during inference where the prior keys/values are cached). When trying to use different query and keys, I get:

RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

This only happens during the backward pass. I can run the forward pass and it won’t crash. Seems like it’s related to FlashAttention indexing something that is out of bounds?

  File "/workspace/models/attention.py", line 176, in flash_attention
    causal=True
  File "/opt/conda/lib/python3.7/site-packages/flash_attn-0.1-py3.7-linux-x86_64.egg/flash_attn/flash_attn_interface.py", line 243, in flash_attn_unpadded_func
    dropout_p, softmax_scale, causal, return_attn_probs)
 (Triggered internally at  /opt/conda/conda-bld/pytorch_1656352464346/work/torch/csrc/autograd/python_anomaly_mode.cpp:102.)
  allow_unreachable=True, accumulate_grad=True)  # Calls into the C++ engine to run the backward pass
Error executing job with overrides: ['name=test-tx-persist', '+experiment=tx']

Note: In my scenario, all batches have the same sequence length. I set cu seqlens q and k as the cumsum of [0]+[q_seq_len] * b and cumsum of [0]+[k_sq] * b respectively.

If I set both keys and queries to have the same sequence length, then everything works.

Issue Analytics

State:
Created a year ago
Comments:23 (16 by maintainers)

Top GitHub Comments

3reactions

tridaocommented, Jul 15, 2022

Ya that makes sense. I’ll try to get to it this weekend (been busy with preparing for ICML talks etc.)

2reactions

lucidrainscommented, Jul 16, 2022

Hi, First of all I would like to congratulate the for this neat work. I am having a lot of fun going through the code and understanding it.

Right now causal means query i will attend to key 1, 2, ..., i. I’m thinking mainly of training auto-regressive LM, where seqlen_q == seqlen_k. I’m not too sure what causal should mean when seqlen_q != seqlen_k. Right now, if there are 2 queries and 10 keys, then query 1 will attend to key 1 and query 2 will attend to key 1 & 2. I’m open to suggestions here (what causal should mean if seqlen_q != seqlen_k). It’s all just simple index calculation in CUDA so it’ll be pretty easy to change.

When is_causal=True and seqlen_q < seqlen_k (not sure what to do in the opposite case seqlen_q > seqlen_k ), I would have expected personally the opposite behaviour, that is if ‘causal’ that should mean query ‘i’ can attend to keys ‘seqlen_k - i, …, seqlen_k-1, seqlen_k’ . This will simulate autoregressive with prefix and will be more useful than ‘1, 2, …, i’, which essentially means that seqlen_k - seqlen_q is ignored, if I understand properly and apologies if misunderstood on my side.

I think this will automatically cover some of the cases above ? Maybe a templated functor can be provided to the kernels to index left, rightmost or other patterns that would/could help other efficient user cases ?

Thank you very much,

if you did it this way, you would simply remove the k_len - q_len keys that are omitted from attention altogether. there wouldn’t be any point to passing it in

the most standard use-case for k_len > q_len when autoregressive setting is turned on (not cross attention), is what Henry described