Failing tests using 3090 RTX
See original GitHub issueHi,
Thank you for developing these kernels. I tried them for timeseries problem and they seem to work as expected.
If I run the test suite i get 3 failing tests:
FAILED tests/test_flash_attn.py::test_flash_attn_race_condition[0.0-1025-32-False-dtype1] - AssertionError: assert False
FAILED tests/test_flash_attn.py::test_flash_attn_race_condition[0.0-1025-16-False-dtype0] - AssertionError: assert False
FAILED tests/test_flash_attn.py::test_flash_attn_race_condition[0.17-1025-32-False-dtype0] - AssertionError: assert False
Do you know if these are important? In general is your advice to use the triton or the cuda kernels?
Flash attention Build:
CC=gcc-11 CXX=g++-11 python setup.py develop
commit 7c9953815aa04bb61e24237ffc29780708cc9c8e (HEAD -> main, origin/main, origin/HEAD)
Author: Tri Dao <tridpq@gmail.com>
Date: Sat Nov 12 19:49:33 2022 -0800
Add fused cross entropy loss
System
Python 3.10.6
torch==1.14.0.dev20221015+cu117
Driver Version: 515.65.01 CUDA Version: 11.7
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Tue_May__3_18:49:52_PDT_2022
Cuda compilation tools, release 11.7, V11.7.64
Build cuda_11.7.r11.7/compiler.31294372_0
ptxas: NVIDIA (R) Ptx optimizing assembler
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Tue_May__3_18:49:03_PDT_2022
Cuda compilation tools, release 11.7, V11.7.64
Build cuda_11.7.r11.7/compiler.31294372_0
Triton 2.0.0: gitsha: commit 0d7e7532279e45672555e344646f5c19c3972331
TEST FAILURES
===================================================================================================================== FAILURES ======================================================================================================================
_____________________________________________________________________________________________ test_flash_attn_race_condition[0.0-1025-32-False-dtype1] ______________________________________________________________________________________________
seqlen = 1025, d = 32, dropout_p = 0.0, causal = False, dtype = torch.bfloat16
@pytest.mark.parametrize('dtype', ([torch.float16] if is_sm75 else [torch.float16, torch.bfloat16]))
# @pytest.mark.parametrize('dtype', [torch.float16])
@pytest.mark.parametrize('causal', [False, True])
@pytest.mark.parametrize('d', [128, 64, 80, 40, 32, 16])
# @pytest.mark.parametrize('d', [64])
@pytest.mark.parametrize('seqlen', [97, 128, 200, 256, 257, 384, 512, 768, 1024, 1025, 2048])
# @pytest.mark.parametrize('seqlen', [128])
@pytest.mark.parametrize('dropout_p', [0.0, 0.17])
# @pytest.mark.parametrize('dropout_p', [0.0])
def test_flash_attn_race_condition(seqlen, d, dropout_p, causal, dtype):
if seqlen >= 2048 and torch.cuda.get_device_properties('cuda').total_memory <= 16 * 2**30:
pytest.skip() # Reference implementation OOM
device = 'cuda'
# set seed
torch.random.manual_seed(0)
batch_size = 32
nheads = 4
x = torch.randn(batch_size, seqlen, nheads * d, device=device, dtype=dtype, requires_grad=True)
Wqkv = torch.nn.Linear(nheads * d, 3 * nheads * d, device=device, dtype=dtype)
query_padding_mask = generate_random_padding_mask(seqlen, batch_size, device, mode='random')
key_padding_mask = generate_random_padding_mask(seqlen, batch_size, device, mode='random')
(q_unpad, k_unpad, v_unpad, cu_seqlens_q, cu_seqlens_k, max_seqlen_q, max_seqlen_k, q, k, v,
output_pad_fn, dq_pad_fn, dk_pad_fn) = generate_qkv(
x, Wqkv, nheads, query_padding_mask, key_padding_mask
)
torch.random.manual_seed(0)
output_unpad_0, sm_lse_0, S_dmask_0 = flash_attn_unpadded_func(
q_unpad, k_unpad, v_unpad, cu_seqlens_q, cu_seqlens_k, max_seqlen_q, max_seqlen_k,
dropout_p, return_attn_probs=True, causal=causal
)
S_dmask_converted_0 = convert_flash_attn_S_to_softmax(
S_dmask_0, query_padding_mask, key_padding_mask, d, dropout_p > 0.0, causal=causal
)
if is_sm80 or d <= 64: # Only run backward for d=128 on A100
g = torch.randn_like(output_unpad_0)
dq_unpad_0, dk_unpad_0, dv_unpad_0, = torch.autograd.grad(output_unpad_0,
(q_unpad, k_unpad, v_unpad), g)
for _ in range(10):
torch.random.manual_seed(0)
output_unpad, sm_lse, S_dmask = flash_attn_unpadded_func(
q_unpad, k_unpad, v_unpad, cu_seqlens_q, cu_seqlens_k, max_seqlen_q, max_seqlen_k,
dropout_p, return_attn_probs=True, causal=causal
)
S_dmask_converted = convert_flash_attn_S_to_softmax(
S_dmask, query_padding_mask, key_padding_mask, d, dropout_p > 0.0, causal=causal
)
assert torch.equal(output_unpad, output_unpad_0)
# sm_lse has some parts that are uninitialized from torch.empty
# assert torch.equal(sm_lse, sm_lse_0)
assert torch.equal(S_dmask_converted, S_dmask_converted_0)
if is_sm80 or d <= 64: # Only run backward for d=128 on A100
dq_unpad, dk_unpad, dv_unpad, = torch.autograd.grad(output_unpad,
(q_unpad, k_unpad, v_unpad), g)
> assert torch.equal(dq_unpad, dq_unpad_0)
E AssertionError: assert False
E + where False = <built-in method equal of type object at 0x7fd2cde0e3a0>(tensor([[[ 1.8921e-03, -1.1414e-02, -6.4087e-03, ..., 7.4158e-03,\n 4.5166e-02, -1.7090e-02],\n [ 2....9062e-02, 1.5488e-03, ..., -5.5847e-03,\n -4.2725e-02, -2.6611e-02]]], device='cuda:0', dtype=torch.bfloat16), tensor([[[ 1.8921e-03, -1.1414e-02, -6.4087e-03, ..., 7.4158e-03,\n 4.5166e-02, -1.7090e-02],\n [ 2....9062e-02, 1.5488e-03, ..., -5.5847e-03,\n -4.2725e-02, -2.6611e-02]]], device='cuda:0', dtype=torch.bfloat16))
E + where <built-in method equal of type object at 0x7fd2cde0e3a0> = torch.equal
tests/test_flash_attn.py:785: AssertionError
_____________________________________________________________________________________________ test_flash_attn_race_condition[0.0-1025-16-False-dtype0] ______________________________________________________________________________________________
seqlen = 1025, d = 16, dropout_p = 0.0, causal = False, dtype = torch.float16
@pytest.mark.parametrize('dtype', ([torch.float16] if is_sm75 else [torch.float16, torch.bfloat16]))
# @pytest.mark.parametrize('dtype', [torch.float16])
@pytest.mark.parametrize('causal', [False, True])
@pytest.mark.parametrize('d', [128, 64, 80, 40, 32, 16])
# @pytest.mark.parametrize('d', [64])
@pytest.mark.parametrize('seqlen', [97, 128, 200, 256, 257, 384, 512, 768, 1024, 1025, 2048])
# @pytest.mark.parametrize('seqlen', [128])
@pytest.mark.parametrize('dropout_p', [0.0, 0.17])
# @pytest.mark.parametrize('dropout_p', [0.0])
def test_flash_attn_race_condition(seqlen, d, dropout_p, causal, dtype):
if seqlen >= 2048 and torch.cuda.get_device_properties('cuda').total_memory <= 16 * 2**30:
pytest.skip() # Reference implementation OOM
device = 'cuda'
# set seed
torch.random.manual_seed(0)
batch_size = 32
nheads = 4
x = torch.randn(batch_size, seqlen, nheads * d, device=device, dtype=dtype, requires_grad=True)
Wqkv = torch.nn.Linear(nheads * d, 3 * nheads * d, device=device, dtype=dtype)
query_padding_mask = generate_random_padding_mask(seqlen, batch_size, device, mode='random')
key_padding_mask = generate_random_padding_mask(seqlen, batch_size, device, mode='random')
(q_unpad, k_unpad, v_unpad, cu_seqlens_q, cu_seqlens_k, max_seqlen_q, max_seqlen_k, q, k, v,
output_pad_fn, dq_pad_fn, dk_pad_fn) = generate_qkv(
x, Wqkv, nheads, query_padding_mask, key_padding_mask
)
torch.random.manual_seed(0)
output_unpad_0, sm_lse_0, S_dmask_0 = flash_attn_unpadded_func(
q_unpad, k_unpad, v_unpad, cu_seqlens_q, cu_seqlens_k, max_seqlen_q, max_seqlen_k,
dropout_p, return_attn_probs=True, causal=causal
)
S_dmask_converted_0 = convert_flash_attn_S_to_softmax(
S_dmask_0, query_padding_mask, key_padding_mask, d, dropout_p > 0.0, causal=causal
)
if is_sm80 or d <= 64: # Only run backward for d=128 on A100
g = torch.randn_like(output_unpad_0)
dq_unpad_0, dk_unpad_0, dv_unpad_0, = torch.autograd.grad(output_unpad_0,
(q_unpad, k_unpad, v_unpad), g)
for _ in range(10):
torch.random.manual_seed(0)
output_unpad, sm_lse, S_dmask = flash_attn_unpadded_func(
q_unpad, k_unpad, v_unpad, cu_seqlens_q, cu_seqlens_k, max_seqlen_q, max_seqlen_k,
dropout_p, return_attn_probs=True, causal=causal
)
S_dmask_converted = convert_flash_attn_S_to_softmax(
S_dmask, query_padding_mask, key_padding_mask, d, dropout_p > 0.0, causal=causal
)
assert torch.equal(output_unpad, output_unpad_0)
# sm_lse has some parts that are uninitialized from torch.empty
# assert torch.equal(sm_lse, sm_lse_0)
assert torch.equal(S_dmask_converted, S_dmask_converted_0)
if is_sm80 or d <= 64: # Only run backward for d=128 on A100
dq_unpad, dk_unpad, dv_unpad, = torch.autograd.grad(output_unpad,
(q_unpad, k_unpad, v_unpad), g)
> assert torch.equal(dq_unpad, dq_unpad_0)
E AssertionError: assert False
E + where False = <built-in method equal of type object at 0x7fd2cde0e3a0>(tensor([[[ 0.0785, 0.0568, -0.0504, ..., -0.0021, -0.0053, 0.0227],\n [ 0.0676, -0.0684, 0.0598, ..., 0.0...,\n [-0.0923, -0.0714, 0.0667, ..., 0.0260, -0.0945, 0.0140]]],\n device='cuda:0', dtype=torch.float16), tensor([[[ 0.0785, 0.0568, -0.0504, ..., -0.0021, -0.0053, 0.0227],\n [ 0.0676, -0.0684, 0.0598, ..., 0.0...,\n [-0.0923, -0.0714, 0.0667, ..., 0.0260, -0.0945, 0.0140]]],\n device='cuda:0', dtype=torch.float16))
E + where <built-in method equal of type object at 0x7fd2cde0e3a0> = torch.equal
tests/test_flash_attn.py:785: AssertionError
_____________________________________________________________________________________________ test_flash_attn_race_condition[0.17-1025-32-False-dtype0] _____________________________________________________________________________________________
seqlen = 1025, d = 32, dropout_p = 0.17, causal = False, dtype = torch.float16
@pytest.mark.parametrize('dtype', ([torch.float16] if is_sm75 else [torch.float16, torch.bfloat16]))
# @pytest.mark.parametrize('dtype', [torch.float16])
@pytest.mark.parametrize('causal', [False, True])
@pytest.mark.parametrize('d', [128, 64, 80, 40, 32, 16])
# @pytest.mark.parametrize('d', [64])
@pytest.mark.parametrize('seqlen', [97, 128, 200, 256, 257, 384, 512, 768, 1024, 1025, 2048])
# @pytest.mark.parametrize('seqlen', [128])
@pytest.mark.parametrize('dropout_p', [0.0, 0.17])
# @pytest.mark.parametrize('dropout_p', [0.0])
def test_flash_attn_race_condition(seqlen, d, dropout_p, causal, dtype):
if seqlen >= 2048 and torch.cuda.get_device_properties('cuda').total_memory <= 16 * 2**30:
pytest.skip() # Reference implementation OOM
device = 'cuda'
# set seed
torch.random.manual_seed(0)
batch_size = 32
nheads = 4
x = torch.randn(batch_size, seqlen, nheads * d, device=device, dtype=dtype, requires_grad=True)
Wqkv = torch.nn.Linear(nheads * d, 3 * nheads * d, device=device, dtype=dtype)
query_padding_mask = generate_random_padding_mask(seqlen, batch_size, device, mode='random')
key_padding_mask = generate_random_padding_mask(seqlen, batch_size, device, mode='random')
(q_unpad, k_unpad, v_unpad, cu_seqlens_q, cu_seqlens_k, max_seqlen_q, max_seqlen_k, q, k, v,
output_pad_fn, dq_pad_fn, dk_pad_fn) = generate_qkv(
x, Wqkv, nheads, query_padding_mask, key_padding_mask
)
torch.random.manual_seed(0)
output_unpad_0, sm_lse_0, S_dmask_0 = flash_attn_unpadded_func(
q_unpad, k_unpad, v_unpad, cu_seqlens_q, cu_seqlens_k, max_seqlen_q, max_seqlen_k,
dropout_p, return_attn_probs=True, causal=causal
)
S_dmask_converted_0 = convert_flash_attn_S_to_softmax(
S_dmask_0, query_padding_mask, key_padding_mask, d, dropout_p > 0.0, causal=causal
)
if is_sm80 or d <= 64: # Only run backward for d=128 on A100
g = torch.randn_like(output_unpad_0)
dq_unpad_0, dk_unpad_0, dv_unpad_0, = torch.autograd.grad(output_unpad_0,
(q_unpad, k_unpad, v_unpad), g)
for _ in range(10):
torch.random.manual_seed(0)
output_unpad, sm_lse, S_dmask = flash_attn_unpadded_func(
q_unpad, k_unpad, v_unpad, cu_seqlens_q, cu_seqlens_k, max_seqlen_q, max_seqlen_k,
dropout_p, return_attn_probs=True, causal=causal
)
S_dmask_converted = convert_flash_attn_S_to_softmax(
S_dmask, query_padding_mask, key_padding_mask, d, dropout_p > 0.0, causal=causal
)
assert torch.equal(output_unpad, output_unpad_0)
# sm_lse has some parts that are uninitialized from torch.empty
# assert torch.equal(sm_lse, sm_lse_0)
assert torch.equal(S_dmask_converted, S_dmask_converted_0)
if is_sm80 or d <= 64: # Only run backward for d=128 on A100
dq_unpad, dk_unpad, dv_unpad, = torch.autograd.grad(output_unpad,
(q_unpad, k_unpad, v_unpad), g)
> assert torch.equal(dq_unpad, dq_unpad_0)
E AssertionError: assert False
E + where False = <built-in method equal of type object at 0x7fd2cde0e3a0>(tensor([[[-0.0155, -0.0330, -0.0320, ..., -0.0228, 0.0065, 0.0318],\n [-0.0445, -0.0417, -0.0245, ..., 0.0...,\n [ 0.0178, 0.0198, 0.0330, ..., -0.0251, -0.0323, 0.0094]]],\n device='cuda:0', dtype=torch.float16), tensor([[[-0.0155, -0.0330, -0.0320, ..., -0.0228, 0.0065, 0.0318],\n [-0.0445, -0.0417, -0.0245, ..., 0.0...,\n [ 0.0178, 0.0198, 0.0330, ..., -0.0251, -0.0323, 0.0094]]],\n device='cuda:0', dtype=torch.float16))
E + where <built-in method equal of type object at 0x7fd2cde0e3a0> = torch.equal
tests/test_flash_attn.py:785: AssertionError
============================================================================================================== short test summary info ==============================================================================================================
FAILED tests/test_flash_attn.py::test_flash_attn_race_condition[0.0-1025-32-False-dtype1] - AssertionError: assert False
FAILED tests/test_flash_attn.py::test_flash_attn_race_condition[0.0-1025-16-False-dtype0] - AssertionError: assert False
FAILED tests/test_flash_attn.py::test_flash_attn_race_condition[0.17-1025-32-False-dtype0] - AssertionError: assert False
3 failed, 2157 passed, 2801 skipped in 94.92s (0:01:34)
Issue Analytics
- State:
- Created 10 months ago
- Comments:5 (2 by maintainers)
Top Results From Across the Web
Test Failures - RTX 3090, Cuda 11.1, Cudnn 8.0.5, AMD ...
I was surprised to see a number of failed tests with OOM errors. The GPU has 24G of memory and <150M was used...
Read more >Question - Help! 3DMark stress test failing with RTX 3090
When stress testing my CPU with Prime95, it was largely fine but after a while every window turned black, so I had to...
Read more >EVGA Identifies This Root Cause For GeForce RTX 3090 ...
After conducting a series of tests and close examinations, EVGA believes it has discovered the reason why some of its GeForce RTX 3090...
Read more >EVGA Attributes GeForce RTX 3090 Failures In Amazon's ...
EVGA said that the failure rate was still "significantly less than 1 percent of the total" number of RTX 3090's the company has...
Read more >Average failure rate of the RTX 3090 Ti - EVGA Forums
Probably need to RMA it. As just a test, try downclocking the memory using PX1/Afterburner and see if the artifacts go away. That's...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
I’m planning to release an optimized implementation of GPT (probably on this repo) this coming week. It’ll have FlashAttention and a bunch of other optimizations.
Thanks.
Looking forward to the transformer release!