Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[BUG] Illegal memory access CUDA error when using long sequences

See original GitHub issue

Describe the bug Running a forward pass on a DeepSpeedTransformerInference layer, with a sequence length of ~1000 tokens, results in an illegal memory access CUDA error.

To Reproduce Here is a minimal reproducible example that shows the bug:

from deepspeed.ops.transformer import DeepSpeedInferenceConfig, DeepSpeedTransformerInference
import torch

torch.cuda.set_device(0)

hidden_size = 256
heads = 8
num_layers = 12
fp16 = True
layernorm_epsilon = 1e-5
deepspeed_config = DeepSpeedInferenceConfig(hidden_size=hidden_size,
                                            intermediate_size=hidden_size * 4,
                                            heads=heads,
                                            num_hidden_layers=num_layers,
                                            layer_norm_eps=layernorm_epsilon,
                                            # encoder_decoder=False,
                                            fp16=fp16,
                                            pre_layer_norm=True,
                                            stochastic_mode=False,
                                            scale_attention=True,
                                            triangular_masking=True,
                                            local_attention=False,
                                            window_size=256,
                                            )
transformer = DeepSpeedTransformerInference(config=deepspeed_config)
transformer.half()
new_state_dict = {k: 0.01*torch.ones(*v.shape, dtype=v.dtype, device=v.device)
                  for k,v in transformer.state_dict().items()}
transformer.load_state_dict(new_state_dict)
transformer.cuda()
device = list(transformer.parameters())[0].device

batch_size = 1
seq_len = 1000
inputs = torch.ones((batch_size, seq_len, hidden_size), dtype=torch.float16, device=device)
input_mask = torch.ones(*inputs.shape[:2], dtype=bool, device=device)

output, _ = transformer(
    input=inputs,
    input_mask=input_mask)

print(f"outupt: \n {output}")

Running the code resulted with the following exception

RuntimeError: CUDA error: an illegal memory access was encountered

Expected behavior I was expecting to get a correct output, without the excpetion.

ds_report output

[2022-06-28 10:35:33,425] [WARNING] [partition_parameters.py:60:<module>] unable to find torch.distributed._all_gather_base. will fall back to torch.distributed.all_gather which will result in suboptimal performance. please consider upgrading your pytorch installation.
--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
cpu_adam ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
 [WARNING]  using untested triton version (1.1.1), only 1.0.0 is known to be compatible
sparse_attn ............ [NO] ....... [NO]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io: please install the libaio-dev package with apt
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
async_io ............... [NO] ....... [NO]
utils .................. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/opt/conda/lib/python3.8/site-packages/torch']
torch version .................... 1.8.0a0+1606899
torch cuda version ............... 11.1
torch hip version ................ None
nvcc version ..................... 11.1
deepspeed install path ........... ['/opt/conda/lib/python3.8/site-packages/deepspeed']
deepspeed info ................... 0.6.5, unknown, unknown
deepspeed wheel compiled w. ...... torch 1.8, cuda 11.1

System info (please complete the following information):

OS: Ubuntu 20.04
GPU count and types: a single A100 GPU
Python version: 3.8.5

Launcher context Launching directly using Python interpreter.

Additional context Maybe the bug is related to line 20 in csrc/transformer/inference/includes/custom_cuda_layers.h? It reads:

#define MAX_OUT_TOKES 1024

Issue Analytics

State:
Created a year ago
Comments:9 (2 by maintainers)

Top GitHub Comments

3reactions

RezaYazdaniAminabadicommented, Jun 28, 2022

Hi @tomeras91

Thanks for reporting this issue. I will look into this. @mrwyattii, thanks for reproducing this. Yes, I think the issue is probably somewhere else. Thanks, Reza

3reactions

mrwyattiicommented, Jun 28, 2022

@tomeras91 I can confirm that I’m able to reproduce this error. I don’t think it has anything to do with MAX_OUT_TOKES. @RezaYazdaniAminabadi could you take a look at this?