Illegal Memory Access when number of keys != queries
See original GitHub issueI’m experimenting with a scenario where I have more keys than queries (e.g., a common scenario during inference where the prior keys/values are cached). When trying to use different query and keys, I get:
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
This only happens during the backward pass. I can run the forward pass and it won’t crash. Seems like it’s related to FlashAttention indexing something that is out of bounds?
File "/workspace/models/attention.py", line 176, in flash_attention
causal=True
File "/opt/conda/lib/python3.7/site-packages/flash_attn-0.1-py3.7-linux-x86_64.egg/flash_attn/flash_attn_interface.py", line 243, in flash_attn_unpadded_func
dropout_p, softmax_scale, causal, return_attn_probs)
(Triggered internally at /opt/conda/conda-bld/pytorch_1656352464346/work/torch/csrc/autograd/python_anomaly_mode.cpp:102.)
allow_unreachable=True, accumulate_grad=True) # Calls into the C++ engine to run the backward pass
Error executing job with overrides: ['name=test-tx-persist', '+experiment=tx']
Note: In my scenario, all batches have the same sequence length. I set cu seqlens q and k as the cumsum of [0]+[q_seq_len] * b and cumsum of [0]+[k_sq] * b respectively.
If I set both keys and queries to have the same sequence length, then everything works.
Issue Analytics
- State:
- Created a year ago
- Comments:23 (16 by maintainers)
Top Results From Across the Web
Why does the code throw illegal memory access error
You are trying to access an element at index -1 , which, when converted into 64-bit unsigned value, is 18446744073709551615 .
Read more >Illegal Memory Access
Illegal Memory Access. (LOCAL:MEMORY:ILLEGAL_ACCESS). This feature depends on valgrind the same way as. LOCAL:MEMORY:INITIALIZATION.
Read more >MongoDB Limits and Thresholds
indexMaxNumGeneratedKeysPerDocument limits the maximum number of keys generated for a single document to prevent out of memory errors.
Read more >Memory Protection in Operating Systems
The keys are based on the use of special codes as we can verify the compliance between using arrays of memory cells and...
Read more >an illegal memory access was encountered cuda kernel ...
Presumably in your CUDA device code, you are doing something like malloc or new (and hopefully also free or delete ). These allocate...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found

Ya that makes sense. I’ll try to get to it this weekend (been busy with preparing for ICML talks etc.)
if you did it this way, you would simply remove the
k_len - q_lenkeys that are omitted from attention altogether. there wouldn’t be any point to passing it inthe most standard use-case for
k_len > q_lenwhen autoregressive setting is turned on (not cross attention), is what Henry described