question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Failing tests using 3090 RTX

See original GitHub issue

Hi,

Thank you for developing these kernels. I tried them for timeseries problem and they seem to work as expected.

If I run the test suite i get 3 failing tests:

FAILED tests/test_flash_attn.py::test_flash_attn_race_condition[0.0-1025-32-False-dtype1] - AssertionError: assert False
FAILED tests/test_flash_attn.py::test_flash_attn_race_condition[0.0-1025-16-False-dtype0] - AssertionError: assert False
FAILED tests/test_flash_attn.py::test_flash_attn_race_condition[0.17-1025-32-False-dtype0] - AssertionError: assert False

Do you know if these are important? In general is your advice to use the triton or the cuda kernels?

Flash attention Build:

CC=gcc-11 CXX=g++-11 python setup.py develop

commit 7c9953815aa04bb61e24237ffc29780708cc9c8e (HEAD -> main, origin/main, origin/HEAD)
Author: Tri Dao <tridpq@gmail.com>
Date:   Sat Nov 12 19:49:33 2022 -0800

    Add fused cross entropy loss

System

Python 3.10.6
torch==1.14.0.dev20221015+cu117
 Driver Version: 515.65.01    CUDA Version: 11.7
 
 nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Tue_May__3_18:49:52_PDT_2022
Cuda compilation tools, release 11.7, V11.7.64
Build cuda_11.7.r11.7/compiler.31294372_0
 
 ptxas: NVIDIA (R) Ptx optimizing assembler
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Tue_May__3_18:49:03_PDT_2022
Cuda compilation tools, release 11.7, V11.7.64
Build cuda_11.7.r11.7/compiler.31294372_0

Triton 2.0.0: gitsha: commit 0d7e7532279e45672555e344646f5c19c3972331
 

TEST FAILURES

===================================================================================================================== FAILURES ======================================================================================================================
_____________________________________________________________________________________________ test_flash_attn_race_condition[0.0-1025-32-False-dtype1] ______________________________________________________________________________________________

seqlen = 1025, d = 32, dropout_p = 0.0, causal = False, dtype = torch.bfloat16

    @pytest.mark.parametrize('dtype', ([torch.float16] if is_sm75 else [torch.float16, torch.bfloat16]))
    # @pytest.mark.parametrize('dtype', [torch.float16])
    @pytest.mark.parametrize('causal', [False, True])
    @pytest.mark.parametrize('d', [128, 64, 80, 40, 32, 16])
    # @pytest.mark.parametrize('d', [64])
    @pytest.mark.parametrize('seqlen', [97, 128, 200, 256, 257, 384, 512, 768, 1024, 1025, 2048])
    # @pytest.mark.parametrize('seqlen', [128])
    @pytest.mark.parametrize('dropout_p', [0.0, 0.17])
    # @pytest.mark.parametrize('dropout_p', [0.0])
    def test_flash_attn_race_condition(seqlen, d, dropout_p, causal, dtype):
        if seqlen >= 2048 and torch.cuda.get_device_properties('cuda').total_memory <= 16 * 2**30:
            pytest.skip()  # Reference implementation OOM
        device = 'cuda'
        # set seed
        torch.random.manual_seed(0)
        batch_size = 32
        nheads = 4
        x = torch.randn(batch_size, seqlen, nheads * d, device=device, dtype=dtype, requires_grad=True)
        Wqkv = torch.nn.Linear(nheads * d, 3 * nheads * d, device=device, dtype=dtype)
    
        query_padding_mask = generate_random_padding_mask(seqlen, batch_size, device, mode='random')
        key_padding_mask = generate_random_padding_mask(seqlen, batch_size, device, mode='random')
    
        (q_unpad, k_unpad, v_unpad, cu_seqlens_q, cu_seqlens_k, max_seqlen_q, max_seqlen_k, q, k, v,
         output_pad_fn, dq_pad_fn, dk_pad_fn) = generate_qkv(
             x, Wqkv, nheads, query_padding_mask, key_padding_mask
         )
    
        torch.random.manual_seed(0)
        output_unpad_0, sm_lse_0, S_dmask_0 = flash_attn_unpadded_func(
            q_unpad, k_unpad, v_unpad, cu_seqlens_q, cu_seqlens_k, max_seqlen_q, max_seqlen_k,
            dropout_p, return_attn_probs=True, causal=causal
        )
        S_dmask_converted_0 = convert_flash_attn_S_to_softmax(
            S_dmask_0, query_padding_mask, key_padding_mask, d, dropout_p > 0.0, causal=causal
        )
    
        if is_sm80 or d <= 64:  # Only run backward for d=128 on A100
            g = torch.randn_like(output_unpad_0)
            dq_unpad_0, dk_unpad_0, dv_unpad_0, = torch.autograd.grad(output_unpad_0,
                                                                      (q_unpad, k_unpad, v_unpad), g)
    
        for _ in range(10):
            torch.random.manual_seed(0)
            output_unpad, sm_lse, S_dmask = flash_attn_unpadded_func(
                q_unpad, k_unpad, v_unpad, cu_seqlens_q, cu_seqlens_k, max_seqlen_q, max_seqlen_k,
                dropout_p, return_attn_probs=True, causal=causal
            )
            S_dmask_converted = convert_flash_attn_S_to_softmax(
                S_dmask, query_padding_mask, key_padding_mask, d, dropout_p > 0.0, causal=causal
            )
            assert torch.equal(output_unpad, output_unpad_0)
            # sm_lse has some parts that are uninitialized from torch.empty
            # assert torch.equal(sm_lse, sm_lse_0)
            assert torch.equal(S_dmask_converted, S_dmask_converted_0)
    
            if is_sm80 or d <= 64:  # Only run backward for d=128 on A100
                dq_unpad, dk_unpad, dv_unpad, = torch.autograd.grad(output_unpad,
                                                                    (q_unpad, k_unpad, v_unpad), g)
>               assert torch.equal(dq_unpad, dq_unpad_0)
E               AssertionError: assert False
E                +  where False = <built-in method equal of type object at 0x7fd2cde0e3a0>(tensor([[[ 1.8921e-03, -1.1414e-02, -6.4087e-03,  ...,  7.4158e-03,\n           4.5166e-02, -1.7090e-02],\n         [ 2....9062e-02,  1.5488e-03,  ..., -5.5847e-03,\n          -4.2725e-02, -2.6611e-02]]], device='cuda:0', dtype=torch.bfloat16), tensor([[[ 1.8921e-03, -1.1414e-02, -6.4087e-03,  ...,  7.4158e-03,\n           4.5166e-02, -1.7090e-02],\n         [ 2....9062e-02,  1.5488e-03,  ..., -5.5847e-03,\n          -4.2725e-02, -2.6611e-02]]], device='cuda:0', dtype=torch.bfloat16))
E                +    where <built-in method equal of type object at 0x7fd2cde0e3a0> = torch.equal

tests/test_flash_attn.py:785: AssertionError
_____________________________________________________________________________________________ test_flash_attn_race_condition[0.0-1025-16-False-dtype0] ______________________________________________________________________________________________

seqlen = 1025, d = 16, dropout_p = 0.0, causal = False, dtype = torch.float16

    @pytest.mark.parametrize('dtype', ([torch.float16] if is_sm75 else [torch.float16, torch.bfloat16]))
    # @pytest.mark.parametrize('dtype', [torch.float16])
    @pytest.mark.parametrize('causal', [False, True])
    @pytest.mark.parametrize('d', [128, 64, 80, 40, 32, 16])
    # @pytest.mark.parametrize('d', [64])
    @pytest.mark.parametrize('seqlen', [97, 128, 200, 256, 257, 384, 512, 768, 1024, 1025, 2048])
    # @pytest.mark.parametrize('seqlen', [128])
    @pytest.mark.parametrize('dropout_p', [0.0, 0.17])
    # @pytest.mark.parametrize('dropout_p', [0.0])
    def test_flash_attn_race_condition(seqlen, d, dropout_p, causal, dtype):
        if seqlen >= 2048 and torch.cuda.get_device_properties('cuda').total_memory <= 16 * 2**30:
            pytest.skip()  # Reference implementation OOM
        device = 'cuda'
        # set seed
        torch.random.manual_seed(0)
        batch_size = 32
        nheads = 4
        x = torch.randn(batch_size, seqlen, nheads * d, device=device, dtype=dtype, requires_grad=True)
        Wqkv = torch.nn.Linear(nheads * d, 3 * nheads * d, device=device, dtype=dtype)
    
        query_padding_mask = generate_random_padding_mask(seqlen, batch_size, device, mode='random')
        key_padding_mask = generate_random_padding_mask(seqlen, batch_size, device, mode='random')
    
        (q_unpad, k_unpad, v_unpad, cu_seqlens_q, cu_seqlens_k, max_seqlen_q, max_seqlen_k, q, k, v,
         output_pad_fn, dq_pad_fn, dk_pad_fn) = generate_qkv(
             x, Wqkv, nheads, query_padding_mask, key_padding_mask
         )
    
        torch.random.manual_seed(0)
        output_unpad_0, sm_lse_0, S_dmask_0 = flash_attn_unpadded_func(
            q_unpad, k_unpad, v_unpad, cu_seqlens_q, cu_seqlens_k, max_seqlen_q, max_seqlen_k,
            dropout_p, return_attn_probs=True, causal=causal
        )
        S_dmask_converted_0 = convert_flash_attn_S_to_softmax(
            S_dmask_0, query_padding_mask, key_padding_mask, d, dropout_p > 0.0, causal=causal
        )
    
        if is_sm80 or d <= 64:  # Only run backward for d=128 on A100
            g = torch.randn_like(output_unpad_0)
            dq_unpad_0, dk_unpad_0, dv_unpad_0, = torch.autograd.grad(output_unpad_0,
                                                                      (q_unpad, k_unpad, v_unpad), g)
    
        for _ in range(10):
            torch.random.manual_seed(0)
            output_unpad, sm_lse, S_dmask = flash_attn_unpadded_func(
                q_unpad, k_unpad, v_unpad, cu_seqlens_q, cu_seqlens_k, max_seqlen_q, max_seqlen_k,
                dropout_p, return_attn_probs=True, causal=causal
            )
            S_dmask_converted = convert_flash_attn_S_to_softmax(
                S_dmask, query_padding_mask, key_padding_mask, d, dropout_p > 0.0, causal=causal
            )
            assert torch.equal(output_unpad, output_unpad_0)
            # sm_lse has some parts that are uninitialized from torch.empty
            # assert torch.equal(sm_lse, sm_lse_0)
            assert torch.equal(S_dmask_converted, S_dmask_converted_0)
    
            if is_sm80 or d <= 64:  # Only run backward for d=128 on A100
                dq_unpad, dk_unpad, dv_unpad, = torch.autograd.grad(output_unpad,
                                                                    (q_unpad, k_unpad, v_unpad), g)
>               assert torch.equal(dq_unpad, dq_unpad_0)
E               AssertionError: assert False
E                +  where False = <built-in method equal of type object at 0x7fd2cde0e3a0>(tensor([[[ 0.0785,  0.0568, -0.0504,  ..., -0.0021, -0.0053,  0.0227],\n         [ 0.0676, -0.0684,  0.0598,  ...,  0.0...,\n         [-0.0923, -0.0714,  0.0667,  ...,  0.0260, -0.0945,  0.0140]]],\n       device='cuda:0', dtype=torch.float16), tensor([[[ 0.0785,  0.0568, -0.0504,  ..., -0.0021, -0.0053,  0.0227],\n         [ 0.0676, -0.0684,  0.0598,  ...,  0.0...,\n         [-0.0923, -0.0714,  0.0667,  ...,  0.0260, -0.0945,  0.0140]]],\n       device='cuda:0', dtype=torch.float16))
E                +    where <built-in method equal of type object at 0x7fd2cde0e3a0> = torch.equal

tests/test_flash_attn.py:785: AssertionError
_____________________________________________________________________________________________ test_flash_attn_race_condition[0.17-1025-32-False-dtype0] _____________________________________________________________________________________________

seqlen = 1025, d = 32, dropout_p = 0.17, causal = False, dtype = torch.float16

    @pytest.mark.parametrize('dtype', ([torch.float16] if is_sm75 else [torch.float16, torch.bfloat16]))
    # @pytest.mark.parametrize('dtype', [torch.float16])
    @pytest.mark.parametrize('causal', [False, True])
    @pytest.mark.parametrize('d', [128, 64, 80, 40, 32, 16])
    # @pytest.mark.parametrize('d', [64])
    @pytest.mark.parametrize('seqlen', [97, 128, 200, 256, 257, 384, 512, 768, 1024, 1025, 2048])
    # @pytest.mark.parametrize('seqlen', [128])
    @pytest.mark.parametrize('dropout_p', [0.0, 0.17])
    # @pytest.mark.parametrize('dropout_p', [0.0])
    def test_flash_attn_race_condition(seqlen, d, dropout_p, causal, dtype):
        if seqlen >= 2048 and torch.cuda.get_device_properties('cuda').total_memory <= 16 * 2**30:
            pytest.skip()  # Reference implementation OOM
        device = 'cuda'
        # set seed
        torch.random.manual_seed(0)
        batch_size = 32
        nheads = 4
        x = torch.randn(batch_size, seqlen, nheads * d, device=device, dtype=dtype, requires_grad=True)
        Wqkv = torch.nn.Linear(nheads * d, 3 * nheads * d, device=device, dtype=dtype)
    
        query_padding_mask = generate_random_padding_mask(seqlen, batch_size, device, mode='random')
        key_padding_mask = generate_random_padding_mask(seqlen, batch_size, device, mode='random')
    
        (q_unpad, k_unpad, v_unpad, cu_seqlens_q, cu_seqlens_k, max_seqlen_q, max_seqlen_k, q, k, v,
         output_pad_fn, dq_pad_fn, dk_pad_fn) = generate_qkv(
             x, Wqkv, nheads, query_padding_mask, key_padding_mask
         )
    
        torch.random.manual_seed(0)
        output_unpad_0, sm_lse_0, S_dmask_0 = flash_attn_unpadded_func(
            q_unpad, k_unpad, v_unpad, cu_seqlens_q, cu_seqlens_k, max_seqlen_q, max_seqlen_k,
            dropout_p, return_attn_probs=True, causal=causal
        )
        S_dmask_converted_0 = convert_flash_attn_S_to_softmax(
            S_dmask_0, query_padding_mask, key_padding_mask, d, dropout_p > 0.0, causal=causal
        )
    
        if is_sm80 or d <= 64:  # Only run backward for d=128 on A100
            g = torch.randn_like(output_unpad_0)
            dq_unpad_0, dk_unpad_0, dv_unpad_0, = torch.autograd.grad(output_unpad_0,
                                                                      (q_unpad, k_unpad, v_unpad), g)
    
        for _ in range(10):
            torch.random.manual_seed(0)
            output_unpad, sm_lse, S_dmask = flash_attn_unpadded_func(
                q_unpad, k_unpad, v_unpad, cu_seqlens_q, cu_seqlens_k, max_seqlen_q, max_seqlen_k,
                dropout_p, return_attn_probs=True, causal=causal
            )
            S_dmask_converted = convert_flash_attn_S_to_softmax(
                S_dmask, query_padding_mask, key_padding_mask, d, dropout_p > 0.0, causal=causal
            )
            assert torch.equal(output_unpad, output_unpad_0)
            # sm_lse has some parts that are uninitialized from torch.empty
            # assert torch.equal(sm_lse, sm_lse_0)
            assert torch.equal(S_dmask_converted, S_dmask_converted_0)
    
            if is_sm80 or d <= 64:  # Only run backward for d=128 on A100
                dq_unpad, dk_unpad, dv_unpad, = torch.autograd.grad(output_unpad,
                                                                    (q_unpad, k_unpad, v_unpad), g)
>               assert torch.equal(dq_unpad, dq_unpad_0)
E               AssertionError: assert False
E                +  where False = <built-in method equal of type object at 0x7fd2cde0e3a0>(tensor([[[-0.0155, -0.0330, -0.0320,  ..., -0.0228,  0.0065,  0.0318],\n         [-0.0445, -0.0417, -0.0245,  ...,  0.0...,\n         [ 0.0178,  0.0198,  0.0330,  ..., -0.0251, -0.0323,  0.0094]]],\n       device='cuda:0', dtype=torch.float16), tensor([[[-0.0155, -0.0330, -0.0320,  ..., -0.0228,  0.0065,  0.0318],\n         [-0.0445, -0.0417, -0.0245,  ...,  0.0...,\n         [ 0.0178,  0.0198,  0.0330,  ..., -0.0251, -0.0323,  0.0094]]],\n       device='cuda:0', dtype=torch.float16))
E                +    where <built-in method equal of type object at 0x7fd2cde0e3a0> = torch.equal

tests/test_flash_attn.py:785: AssertionError
============================================================================================================== short test summary info ==============================================================================================================
FAILED tests/test_flash_attn.py::test_flash_attn_race_condition[0.0-1025-32-False-dtype1] - AssertionError: assert False
FAILED tests/test_flash_attn.py::test_flash_attn_race_condition[0.0-1025-16-False-dtype0] - AssertionError: assert False
FAILED tests/test_flash_attn.py::test_flash_attn_race_condition[0.17-1025-32-False-dtype0] - AssertionError: assert False
3 failed, 2157 passed, 2801 skipped in 94.92s (0:01:34)

Issue Analytics

  • State:closed
  • Created 10 months ago
  • Comments:5 (2 by maintainers)

github_iconTop GitHub Comments

3reactions
tridaocommented, Nov 13, 2022

Is there a reason for not providing the full transformer implementation? I have a minimal version based on MinGPT that I can submit a PR for?

I’m planning to release an optimized implementation of GPT (probably on this repo) this coming week. It’ll have FlashAttention and a bunch of other optimizations.

0reactions
skaaecommented, Nov 13, 2022

Thanks.

Looking forward to the transformer release!

Read more comments on GitHub >

github_iconTop Results From Across the Web

Test Failures - RTX 3090, Cuda 11.1, Cudnn 8.0.5, AMD ...
I was surprised to see a number of failed tests with OOM errors. The GPU has 24G of memory and <150M was used...
Read more >
Question - Help! 3DMark stress test failing with RTX 3090
When stress testing my CPU with Prime95, it was largely fine but after a while every window turned black, so I had to...
Read more >
EVGA Identifies This Root Cause For GeForce RTX 3090 ...
After conducting a series of tests and close examinations, EVGA believes it has discovered the reason why some of its GeForce RTX 3090...
Read more >
EVGA Attributes GeForce RTX 3090 Failures In Amazon's ...
EVGA said that the failure rate was still "significantly less than 1 percent of the total" number of RTX 3090's the company has...
Read more >
Average failure rate of the RTX 3090 Ti - EVGA Forums
Probably need to RMA it. As just a test, try downclocking the memory using PX1/Afterburner and see if the artifacts go away. That's...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found