[BUG] Sparse attention results in map::at triton error or CUDA: Error- invalid image triton error
See original GitHub issueDescribe the bug
I was unable to get sparse attention to run on my machine. When using sparse attention, Triton throws either IndexError: map::at
when using Triton 1.0.0 or CUDA: Error- invalid image
on the latest version, Triton 1.1.1. This is related to https://github.com/EleutherAI/gpt-neox/issues/472, with the difference being that I’ve boiled down the issue to a very short and simple example here.
To Reproduce
I saw the below note at https://www.deepspeed.ai/tutorials/sparse-attention/, which I used as reference to setup my environment:
Note: Currently DeepSpeed Sparse Attention can be used only on NVIDIA V100 GPU using Torch >= 1.5 and Cuda 10.1 or 10.2.
I created and attached to a CUDA 10.2 docker container:
docker run --gpus all -ti -d --name deepspeed -v ~/:/home nvidia/cuda:10.2-devel-ubuntu18.04
docker attach deepspeed
Then I installed basic dependencies:
apt update
apt install python3.8 python3.8-dev and python3.8-venv python3-venv libopenmpi-dev
I created and activated a new Python 3.8 virtual environment:
python3.8 -m venv ~/.virtualenvs/deepspeedtest
. ~/.virtualenvs/deepspeedtest/bin/activate
python -m pip install --upgrade pip wheel
Then I installed PyTorch 1.5 for CUDA 10.2, followed by the latest version of DeepSpeed:
pip install torch==1.5.0 torchvision==0.6.0
pip install deepspeed
I wrote this simple script to test sparse self attention:
sparse_attention_test.py:
import deepspeed
import torch
sparse_self_attention = deepspeed.ops.sparse_attention.SparseSelfAttention(
sparsity_config = deepspeed.ops.sparse_attention.FixedSparsityConfig(
2,
attention="unidirectional"
)
)
query = torch.rand((4, 2, 128, 512)).to(torch.float16).to("cuda")
key = torch.rand((4, 2, 128, 512)).to(torch.float16).to("cuda")
value = torch.rand((4, 2, 128, 512)).to(torch.float16).to("cuda")
context = sparse_self_attention(query, key, value)
print(context)
(I wasn’t entirely sure what tensor shape the method expected, but after adding some print statements to the Bing BERT example mentioned in the sparse attention tutorial, it appears the expected input shape is (batch_size, num_heads, seq_length, head_size)
.)
I ran the script directly with python:
python sparse_attention_test.py
Then I got a ModuleNotFoundError: No module named 'triton'
error, so I installed the latest version of Triton:
pip install triton
I reran the script and got the following error:
AttributeError: module 'torch' has no attribute 'is_autocast_enabled'
The is_autocast_enabled
method doesn’t appear to have been added to PyTorch until version 1.6, so I made a new Python virtual environment as before, with the only change being that I installed PyTorch 1.6 instead of 1.5:
pip install torch==1.6.0 torchvision==0.7.0
Rerunning the code, I got a CUDA: Error- invalid image
error from Triton.
Full traceback
Traceback (most recent call last): File "test.py", line 15, in <module> context = sparse_self_attention(query, key, value) File "/root/.virtualenvs/deepspeedtest_torch16/lib/python3.8/site-packages/torch/nn/modules/module.py", line 722, in _call_impl (deepspeedtest_torch16) root@0eb5ee8aea44:/home/sa_test# python test.py /root/.virtualenvs/deepspeedtest_torch16/lib/python3.8/site-packages/deepspeed/ops/sparse_attention/matmul.py:460: UserWarning: This overload of nonzero is deprecated: nonzero() Consider using one of the following signatures instead: nonzero(*, bool as_tuple) (Triggered internally at /pytorch/torch/csrc/utils/python_arg_parser.cpp:766.) nnz = layout.nonzero() Traceback (most recent call last): File "test.py", line 15, in <module> context = sparse_self_attention(query, key, value) File "/root/.virtualenvs/deepspeedtest_torch16/lib/python3.8/site-packages/torch/nn/modules/module.py", line 722, in _call_impl result = self.forward(*input, **kwargs) File "/root/.virtualenvs/deepspeedtest_torch16/lib/python3.8/site-packages/deepspeed/ops/sparse_attention/sparse_self_attention.py", line 153, in forward attn_output_weights = sparse_dot_sdd_nt(query, key) File "/root/.virtualenvs/deepspeedtest_torch16/lib/python3.8/site-packages/deepspeed/ops/sparse_attention/matmul.py", line 911, in __call__ c = _sparse_matmul.apply(a, File "/root/.virtualenvs/deepspeedtest_torch16/lib/python3.8/site-packages/deepspeed/ops/sparse_attention/matmul.py", line 701, in forward c = _sparse_matmul.fn[mode](a, File "/root/.virtualenvs/deepspeedtest_torch16/lib/python3.8/site-packages/deepspeed/ops/sparse_attention/matmul.py", line 395, in _sdd_matmul _kernel[grid](a, File "/root/.virtualenvs/deepspeedtest_torch16/lib/python3.8/site-packages/triton/code_gen.py", line 676, in __call__ return self.kernel(*wargs, **kwargs, grid=self.grid) File "/root/.virtualenvs/deepspeedtest_torch16/lib/python3.8/site-packages/triton/code_gen.py", line 644, in __call__ binary = self._compile( File "/root/.virtualenvs/deepspeedtest_torch16/lib/python3.8/site-packages/triton/code_gen.py", line 563, in _compile name, asm, shared_mem = _triton.code_gen.compile_ttir(backend, generator.module, device, num_warps, num_stages, force_nc_cache) RuntimeError: CUDA: Error- invalid image
I thought the issue might lie with the Triton version, so I downgraded Triton to version 1.0.0:
pip install triton==1.0.0
Rerunning the script, I got an IndexError: map::at
error from Triton. This is the same error referenced by https://github.com/EleutherAI/gpt-neox/issues/472.
Full traceback
/root/.virtualenvs/deepspeedtest_torch16/lib/python3.8/site-packages/deepspeed/ops/sparse_attention/matmul.py:460: UserWarning: This overload of nonzero is deprecated: nonzero() Consider using one of the following signatures instead: nonzero(*, bool as_tuple) (Triggered internally at /pytorch/torch/csrc/utils/python_arg_parser.cpp:766.) nnz = layout.nonzero() Traceback (most recent call last): File "test.py", line 15, in <module> context = sparse_self_attention(query, key, value) File "/root/.virtualenvs/deepspeedtest_torch16/lib/python3.8/site-packages/torch/nn/modules/module.py", line 722, in _call_impl result = self.forward(*input, **kwargs) File "/root/.virtualenvs/deepspeedtest_torch16/lib/python3.8/site-packages/deepspeed/ops/sparse_attention/sparse_self_attention.py", line 153, in forward attn_output_weights = sparse_dot_sdd_nt(query, key) File "/root/.virtualenvs/deepspeedtest_torch16/lib/python3.8/site-packages/deepspeed/ops/sparse_attention/matmul.py", line 911, in __call__ c = _sparse_matmul.apply(a, File "/root/.virtualenvs/deepspeedtest_torch16/lib/python3.8/site-packages/deepspeed/ops/sparse_attention/matmul.py", line 701, in forward c = _sparse_matmul.fn[mode](a, File "/root/.virtualenvs/deepspeedtest_torch16/lib/python3.8/site-packages/deepspeed/ops/sparse_attention/matmul.py", line 395, in _sdd_matmul _kernel[grid](a, File "/root/.virtualenvs/deepspeedtest_torch16/lib/python3.8/site-packages/triton/code_gen.py", line 599, in __call__ return self.kernel(*wargs, **kwargs, grid=self.grid) File "/root/.virtualenvs/deepspeedtest_torch16/lib/python3.8/site-packages/triton/code_gen.py", line 576, in __call__ cache[key] = self._compile( File "/root/.virtualenvs/deepspeedtest_torch16/lib/python3.8/site-packages/triton/code_gen.py", line 550, in _compile mod, ker, shared_mem, ir_asm = _triton.code_gen.add_passes_to_emit_bin(generator.module, tt_device, num_warps, num_stages, force_nc_cache) IndexError: map::at
I went on to try numerous permutations of PyTorch, DeepSpeed, CUDA, and Triton versions, and they all errored out.
Expected behavior
The SparseSelfAttention class’s forward method should return the context layer.
ds_report output
--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
runtime if needed. Op compatibility means that your system
meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
cpu_adam ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
sparse_attn ............ [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
[WARNING] async_io requires the dev libaio .so object and headers but these were not found.
[WARNING] async_io: please install the libaio-dev package with apt
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
async_io ............... [NO] ....... [NO]
transformer_inference .. [NO] ....... [OKAY]
utils .................. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/root/.virtualenvs/deepspeedtest_torch16/lib/python3.8/site-packages/torch']
torch version .................... 1.6.0
torch cuda version ............... 10.2
nvcc version ..................... 10.2
deepspeed install path ........... ['/root/.virtualenvs/deepspeedtest_torch16/lib/python3.8/site-packages/deepspeed']
deepspeed info ................... 0.5.7, unknown, unknown
deepspeed wheel compiled w. ...... torch 0.0, cuda 0.0
Screenshots
N/A
System info (please complete the following information):
- OS: Ubuntu 18.04.6 LTS (host) nvidia/cuda:10.2-devel-ubuntu18.04 (docker container)
- GPU count and types: One machine with 1x Tesla V100-SXM2-16GB
- Interconnects (if applicable): N/A
- Python version: 3.8.0
- Any other relevant info about your setup: See below
NVCC 10.2
nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2019 NVIDIA Corporation Built on Wed_Oct_23_19:24:38_PDT_2019 Cuda compilation tools, release 10.2, V10.2.89
Nvidia driver 495.44
Mon Nov 29 01:18:10 2021 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 495.44 Driver Version: 495.44 CUDA Version: 11.5 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 Tesla V100-SXM2... Off | 00000000:00:04.0 Off | 0 | | N/A 36C P0 38W / 300W | 0MiB / 16160MiB | 1% Default | | | | N/A | +-------------------------------+----------------------+----------------------+
gcc 7.5.0
gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
pip freeze
deepspeed==0.5.7 filelock==3.4.0 future==0.18.2 hjson==3.0.2 ninja==1.10.2.3 numpy==1.21.4 packaging==21.3 Pillow==8.4.0 pkg_resources==0.0.0 psutil==5.8.0 pyparsing==3.0.6 torch==1.6.0 torchvision==0.7.0 tqdm==4.62.3 triton==1.1.1
Launcher context
I am not launching my experiment with the deepspeed
launcher, MPI, or something else.
Docker context
nvidia/cuda:10.2-devel-ubuntu18.04
Additional context
None
Issue Analytics
- State:
- Created 2 years ago
- Comments:7 (6 by maintainers)
Top GitHub Comments
Just saw that https://www.deepspeed.ai/ is github pages and part of this repo. I can submit a PR that updates both
requirements-sparse_attn.txt
and the sparse attention tutorial’s note if those changes sound good.Edit: just successfully ran the sparse attention test on an A100 under both CUDA 11.0 and 11.1, so I can add that to the note as well.
Excellent! Yes please, if you submit a PR for both reqs file and docs that would be greatly appreciated 😃
We’ll have to dig into why newer triton doesn’t work but glad that the older version still works. Will investigate on our side but will probably have to ping @ptillet