Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[BUG] Sparse attention results in map::at triton error or CUDA: Error- invalid image triton error

See original GitHub issue

Describe the bug

I was unable to get sparse attention to run on my machine. When using sparse attention, Triton throws either IndexError: map::at when using Triton 1.0.0 or CUDA: Error- invalid image on the latest version, Triton 1.1.1. This is related to https://github.com/EleutherAI/gpt-neox/issues/472, with the difference being that I’ve boiled down the issue to a very short and simple example here.

To Reproduce

I saw the below note at https://www.deepspeed.ai/tutorials/sparse-attention/, which I used as reference to setup my environment:

Note: Currently DeepSpeed Sparse Attention can be used only on NVIDIA V100 GPU using Torch >= 1.5 and Cuda 10.1 or 10.2.

I created and attached to a CUDA 10.2 docker container:

docker run --gpus all -ti -d --name deepspeed -v ~/:/home nvidia/cuda:10.2-devel-ubuntu18.04
docker attach deepspeed

Then I installed basic dependencies:

apt update
apt install python3.8 python3.8-dev and python3.8-venv python3-venv libopenmpi-dev

I created and activated a new Python 3.8 virtual environment:

python3.8 -m venv ~/.virtualenvs/deepspeedtest
. ~/.virtualenvs/deepspeedtest/bin/activate
python -m pip install --upgrade pip wheel

Then I installed PyTorch 1.5 for CUDA 10.2, followed by the latest version of DeepSpeed:

pip install torch==1.5.0 torchvision==0.6.0
pip install deepspeed

I wrote this simple script to test sparse self attention:

sparse_attention_test.py:

import deepspeed
import torch

sparse_self_attention = deepspeed.ops.sparse_attention.SparseSelfAttention(
    sparsity_config = deepspeed.ops.sparse_attention.FixedSparsityConfig(
        2,
        attention="unidirectional"
    )
)

query = torch.rand((4, 2, 128, 512)).to(torch.float16).to("cuda")
key = torch.rand((4, 2, 128, 512)).to(torch.float16).to("cuda")
value = torch.rand((4, 2, 128, 512)).to(torch.float16).to("cuda")

context = sparse_self_attention(query, key, value)

print(context)

(I wasn’t entirely sure what tensor shape the method expected, but after adding some print statements to the Bing BERT example mentioned in the sparse attention tutorial, it appears the expected input shape is (batch_size, num_heads, seq_length, head_size).)

I ran the script directly with python:

python sparse_attention_test.py

Then I got a ModuleNotFoundError: No module named 'triton' error, so I installed the latest version of Triton:

pip install triton

I reran the script and got the following error:

AttributeError: module 'torch' has no attribute 'is_autocast_enabled'

The is_autocast_enabled method doesn’t appear to have been added to PyTorch until version 1.6, so I made a new Python virtual environment as before, with the only change being that I installed PyTorch 1.6 instead of 1.5:

pip install torch==1.6.0 torchvision==0.7.0

Rerunning the code, I got a CUDA: Error- invalid image error from Triton.

Full traceback

Traceback (most recent call last):
  File "test.py", line 15, in <module>
    context = sparse_self_attention(query, key, value)
  File "/root/.virtualenvs/deepspeedtest_torch16/lib/python3.8/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
(deepspeedtest_torch16) root@0eb5ee8aea44:/home/sa_test# python test.py 
/root/.virtualenvs/deepspeedtest_torch16/lib/python3.8/site-packages/deepspeed/ops/sparse_attention/matmul.py:460: UserWarning: This overload of nonzero is deprecated:
        nonzero()
Consider using one of the following signatures instead:
        nonzero(*, bool as_tuple) (Triggered internally at  /pytorch/torch/csrc/utils/python_arg_parser.cpp:766.)
  nnz = layout.nonzero()
Traceback (most recent call last):
  File "test.py", line 15, in <module>
    context = sparse_self_attention(query, key, value)
  File "/root/.virtualenvs/deepspeedtest_torch16/lib/python3.8/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/root/.virtualenvs/deepspeedtest_torch16/lib/python3.8/site-packages/deepspeed/ops/sparse_attention/sparse_self_attention.py", line 153, in forward
    attn_output_weights = sparse_dot_sdd_nt(query, key)
  File "/root/.virtualenvs/deepspeedtest_torch16/lib/python3.8/site-packages/deepspeed/ops/sparse_attention/matmul.py", line 911, in __call__
    c = _sparse_matmul.apply(a,
  File "/root/.virtualenvs/deepspeedtest_torch16/lib/python3.8/site-packages/deepspeed/ops/sparse_attention/matmul.py", line 701, in forward
    c = _sparse_matmul.fn[mode](a,
  File "/root/.virtualenvs/deepspeedtest_torch16/lib/python3.8/site-packages/deepspeed/ops/sparse_attention/matmul.py", line 395, in _sdd_matmul
    _kernel[grid](a,
  File "/root/.virtualenvs/deepspeedtest_torch16/lib/python3.8/site-packages/triton/code_gen.py", line 676, in __call__
    return self.kernel(*wargs, **kwargs, grid=self.grid)
  File "/root/.virtualenvs/deepspeedtest_torch16/lib/python3.8/site-packages/triton/code_gen.py", line 644, in __call__
    binary = self._compile(
  File "/root/.virtualenvs/deepspeedtest_torch16/lib/python3.8/site-packages/triton/code_gen.py", line 563, in _compile
    name, asm, shared_mem = _triton.code_gen.compile_ttir(backend, generator.module, device, num_warps, num_stages, force_nc_cache)
RuntimeError: CUDA: Error- invalid image

I thought the issue might lie with the Triton version, so I downgraded Triton to version 1.0.0:

pip install triton==1.0.0

Rerunning the script, I got an IndexError: map::at error from Triton. This is the same error referenced by https://github.com/EleutherAI/gpt-neox/issues/472.

Full traceback

/root/.virtualenvs/deepspeedtest_torch16/lib/python3.8/site-packages/deepspeed/ops/sparse_attention/matmul.py:460: UserWarning: This overload of nonzero is deprecated:
        nonzero()
Consider using one of the following signatures instead:
        nonzero(*, bool as_tuple) (Triggered internally at  /pytorch/torch/csrc/utils/python_arg_parser.cpp:766.)
  nnz = layout.nonzero()
Traceback (most recent call last):
  File "test.py", line 15, in <module>
    context = sparse_self_attention(query, key, value)
  File "/root/.virtualenvs/deepspeedtest_torch16/lib/python3.8/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/root/.virtualenvs/deepspeedtest_torch16/lib/python3.8/site-packages/deepspeed/ops/sparse_attention/sparse_self_attention.py", line 153, in forward
    attn_output_weights = sparse_dot_sdd_nt(query, key)
  File "/root/.virtualenvs/deepspeedtest_torch16/lib/python3.8/site-packages/deepspeed/ops/sparse_attention/matmul.py", line 911, in __call__
    c = _sparse_matmul.apply(a,
  File "/root/.virtualenvs/deepspeedtest_torch16/lib/python3.8/site-packages/deepspeed/ops/sparse_attention/matmul.py", line 701, in forward
    c = _sparse_matmul.fn[mode](a,
  File "/root/.virtualenvs/deepspeedtest_torch16/lib/python3.8/site-packages/deepspeed/ops/sparse_attention/matmul.py", line 395, in _sdd_matmul
    _kernel[grid](a,
  File "/root/.virtualenvs/deepspeedtest_torch16/lib/python3.8/site-packages/triton/code_gen.py", line 599, in __call__
    return self.kernel(*wargs, **kwargs, grid=self.grid)
  File "/root/.virtualenvs/deepspeedtest_torch16/lib/python3.8/site-packages/triton/code_gen.py", line 576, in __call__
    cache[key] = self._compile(
  File "/root/.virtualenvs/deepspeedtest_torch16/lib/python3.8/site-packages/triton/code_gen.py", line 550, in _compile
    mod, ker, shared_mem, ir_asm = _triton.code_gen.add_passes_to_emit_bin(generator.module, tt_device, num_warps, num_stages, force_nc_cache)
IndexError: map::at

I went on to try numerous permutations of PyTorch, DeepSpeed, CUDA, and Triton versions, and they all errored out.

Expected behavior

The SparseSelfAttention class’s forward method should return the context layer.

ds_report output

--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
cpu_adam ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
sparse_attn ............ [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io: please install the libaio-dev package with apt
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
async_io ............... [NO] ....... [NO]
transformer_inference .. [NO] ....... [OKAY]
utils .................. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/root/.virtualenvs/deepspeedtest_torch16/lib/python3.8/site-packages/torch']
torch version .................... 1.6.0
torch cuda version ............... 10.2
nvcc version ..................... 10.2
deepspeed install path ........... ['/root/.virtualenvs/deepspeedtest_torch16/lib/python3.8/site-packages/deepspeed']
deepspeed info ................... 0.5.7, unknown, unknown
deepspeed wheel compiled w. ...... torch 0.0, cuda 0.0

Screenshots

N/A

System info (please complete the following information):

OS: Ubuntu 18.04.6 LTS (host) nvidia/cuda:10.2-devel-ubuntu18.04 (docker container)
GPU count and types: One machine with 1x Tesla V100-SXM2-16GB
Interconnects (if applicable): N/A
Python version: 3.8.0
Any other relevant info about your setup: See below

NVCC 10.2

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2019 NVIDIA Corporation
Built on Wed_Oct_23_19:24:38_PDT_2019
Cuda compilation tools, release 10.2, V10.2.89

Nvidia driver 495.44

Mon Nov 29 01:18:10 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 495.44       Driver Version: 495.44       CUDA Version: 11.5     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   36C    P0    38W / 300W |      0MiB / 16160MiB |      1%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

gcc 7.5.0

gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0

pip freeze

deepspeed==0.5.7
filelock==3.4.0
future==0.18.2
hjson==3.0.2
ninja==1.10.2.3
numpy==1.21.4
packaging==21.3
Pillow==8.4.0
pkg_resources==0.0.0
psutil==5.8.0
pyparsing==3.0.6
torch==1.6.0
torchvision==0.7.0
tqdm==4.62.3
triton==1.1.1

Launcher context

I am not launching my experiment with the deepspeed launcher, MPI, or something else.

Docker context

nvidia/cuda:10.2-devel-ubuntu18.04

Additional context

None

Issue Analytics

State:
Created 2 years ago
Comments:7 (6 by maintainers)

Top GitHub Comments

1reaction

pwstegmancommented, Dec 2, 2021

Just saw that https://www.deepspeed.ai/ is github pages and part of this repo. I can submit a PR that updates both requirements-sparse_attn.txt and the sparse attention tutorial’s note if those changes sound good.

Edit: just successfully ran the sparse attention test on an A100 under both CUDA 11.0 and 11.1, so I can add that to the note as well.

0reactions

jeffracommented, Dec 2, 2021

Excellent! Yes please, if you submit a PR for both reqs file and docs that would be greatly appreciated 😃

We’ll have to dig into why newer triton doesn’t work but glad that the older version still works. Will investigate on our side but will probably have to ping @ptillet