question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Illegal Memory Access when number of keys != queries

See original GitHub issue

I’m experimenting with a scenario where I have more keys than queries (e.g., a common scenario during inference where the prior keys/values are cached). When trying to use different query and keys, I get:

RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

This only happens during the backward pass. I can run the forward pass and it won’t crash. Seems like it’s related to FlashAttention indexing something that is out of bounds?

  File "/workspace/models/attention.py", line 176, in flash_attention
    causal=True
  File "/opt/conda/lib/python3.7/site-packages/flash_attn-0.1-py3.7-linux-x86_64.egg/flash_attn/flash_attn_interface.py", line 243, in flash_attn_unpadded_func
    dropout_p, softmax_scale, causal, return_attn_probs)
 (Triggered internally at  /opt/conda/conda-bld/pytorch_1656352464346/work/torch/csrc/autograd/python_anomaly_mode.cpp:102.)
  allow_unreachable=True, accumulate_grad=True)  # Calls into the C++ engine to run the backward pass
Error executing job with overrides: ['name=test-tx-persist', '+experiment=tx']

Note: In my scenario, all batches have the same sequence length. I set cu seqlens q and k as the cumsum of [0]+[q_seq_len] * b and cumsum of [0]+[k_sq] * b respectively.

If I set both keys and queries to have the same sequence length, then everything works.

Issue Analytics

  • State:open
  • Created a year ago
  • Comments:23 (16 by maintainers)

github_iconTop GitHub Comments

3reactions
tridaocommented, Jul 15, 2022

Ya that makes sense. I’ll try to get to it this weekend (been busy with preparing for ICML talks etc.)

2reactions
lucidrainscommented, Jul 16, 2022

Hi, First of all I would like to congratulate the for this neat work. I am having a lot of fun going through the code and understanding it.

Right now causal means query i will attend to key 1, 2, ..., i. I’m thinking mainly of training auto-regressive LM, where seqlen_q == seqlen_k. I’m not too sure what causal should mean when seqlen_q != seqlen_k. Right now, if there are 2 queries and 10 keys, then query 1 will attend to key 1 and query 2 will attend to key 1 & 2. I’m open to suggestions here (what causal should mean if seqlen_q != seqlen_k). It’s all just simple index calculation in CUDA so it’ll be pretty easy to change.

When is_causal=True and seqlen_q < seqlen_k (not sure what to do in the opposite case seqlen_q > seqlen_k ), I would have expected personally the opposite behaviour, that is if ‘causal’ that should mean query ‘i’ can attend to keys ‘seqlen_k - i, …, seqlen_k-1, seqlen_k’ . This will simulate autoregressive with prefix and will be more useful than ‘1, 2, …, i’, which essentially means that seqlen_k - seqlen_q is ignored, if I understand properly and apologies if misunderstood on my side.

I think this will automatically cover some of the cases above ? Maybe a templated functor can be provided to the kernels to index left, rightmost or other patterns that would/could help other efficient user cases ?

Thank you very much,

if you did it this way, you would simply remove the k_len - q_len keys that are omitted from attention altogether. there wouldn’t be any point to passing it in

the most standard use-case for k_len > q_len when autoregressive setting is turned on (not cross attention), is what Henry described

Read more comments on GitHub >

github_iconTop Results From Across the Web

Why does the code throw illegal memory access error
You are trying to access an element at index -1 , which, when converted into 64-bit unsigned value, is 18446744073709551615 .
Read more >
Illegal Memory Access
Illegal Memory Access. (LOCAL:MEMORY:ILLEGAL_ACCESS). This feature depends on valgrind the same way as. LOCAL:MEMORY:INITIALIZATION.
Read more >
MongoDB Limits and Thresholds
indexMaxNumGeneratedKeysPerDocument limits the maximum number of keys generated for a single document to prevent out of memory errors.
Read more >
Memory Protection in Operating Systems
The keys are based on the use of special codes as we can verify the compliance between using arrays of memory cells and...
Read more >
an illegal memory access was encountered cuda kernel ...
Presumably in your CUDA device code, you are doing something like malloc or new (and hopefully also free or delete ). These allocate...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found