[BUG] Illegal memory access CUDA error when using long sequences
See original GitHub issueDescribe the bug
Running a forward pass on a DeepSpeedTransformerInference
layer, with a sequence length of ~1000 tokens, results in an illegal memory access CUDA error.
To Reproduce Here is a minimal reproducible example that shows the bug:
from deepspeed.ops.transformer import DeepSpeedInferenceConfig, DeepSpeedTransformerInference
import torch
torch.cuda.set_device(0)
hidden_size = 256
heads = 8
num_layers = 12
fp16 = True
layernorm_epsilon = 1e-5
deepspeed_config = DeepSpeedInferenceConfig(hidden_size=hidden_size,
intermediate_size=hidden_size * 4,
heads=heads,
num_hidden_layers=num_layers,
layer_norm_eps=layernorm_epsilon,
# encoder_decoder=False,
fp16=fp16,
pre_layer_norm=True,
stochastic_mode=False,
scale_attention=True,
triangular_masking=True,
local_attention=False,
window_size=256,
)
transformer = DeepSpeedTransformerInference(config=deepspeed_config)
transformer.half()
new_state_dict = {k: 0.01*torch.ones(*v.shape, dtype=v.dtype, device=v.device)
for k,v in transformer.state_dict().items()}
transformer.load_state_dict(new_state_dict)
transformer.cuda()
device = list(transformer.parameters())[0].device
batch_size = 1
seq_len = 1000
inputs = torch.ones((batch_size, seq_len, hidden_size), dtype=torch.float16, device=device)
input_mask = torch.ones(*inputs.shape[:2], dtype=bool, device=device)
output, _ = transformer(
input=inputs,
input_mask=input_mask)
print(f"outupt: \n {output}")
Running the code resulted with the following exception
RuntimeError: CUDA error: an illegal memory access was encountered
Expected behavior I was expecting to get a correct output, without the excpetion.
ds_report output
[2022-06-28 10:35:33,425] [WARNING] [partition_parameters.py:60:<module>] unable to find torch.distributed._all_gather_base. will fall back to torch.distributed.all_gather which will result in suboptimal performance. please consider upgrading your pytorch installation.
--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
runtime if needed. Op compatibility means that your system
meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
cpu_adam ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
[WARNING] using untested triton version (1.1.1), only 1.0.0 is known to be compatible
sparse_attn ............ [NO] ....... [NO]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
[WARNING] async_io requires the dev libaio .so object and headers but these were not found.
[WARNING] async_io: please install the libaio-dev package with apt
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
async_io ............... [NO] ....... [NO]
utils .................. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/opt/conda/lib/python3.8/site-packages/torch']
torch version .................... 1.8.0a0+1606899
torch cuda version ............... 11.1
torch hip version ................ None
nvcc version ..................... 11.1
deepspeed install path ........... ['/opt/conda/lib/python3.8/site-packages/deepspeed']
deepspeed info ................... 0.6.5, unknown, unknown
deepspeed wheel compiled w. ...... torch 1.8, cuda 11.1
System info (please complete the following information):
- OS: Ubuntu 20.04
- GPU count and types: a single A100 GPU
- Python version: 3.8.5
Launcher context Launching directly using Python interpreter.
Additional context
Maybe the bug is related to line 20 in csrc/transformer/inference/includes/custom_cuda_layers.h
? It reads:
#define MAX_OUT_TOKES 1024
Issue Analytics
- State:
- Created a year ago
- Comments:9 (2 by maintainers)
Top Results From Across the Web
gpu erreur: an illegal memory access was encountered
This error show up many and many times i handled it many time by changing the way to get something or write something,...
Read more >Illegal memory access with V100-SXM2, but not K80 or ...
I'm trying to transition our CUDA code to a new AWS instance (P3) that uses a V100-SXM2 (Volta) GPU, but am now encountering...
Read more >an illegal memory access was encountered cuda kernel errors ...
cuda error : an illegal memory access was encountered cuda kernel errors might be asynchronously reported at some other api call,so the stacktrace...
Read more >Cuda illegal memory access(kokkos) multiple MPI per GPU
If it is supported by KOKKOS then multiple MPI ranks per GPU should work (however inadvisable that may be). To track down what...
Read more >Octane Render | Hey Guys. - Facebook
I'm running into CUDA error 700 (an illegal memory access was encountered) on all my devices (3x1080ti) The scene ... You have to...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Hi @tomeras91
Thanks for reporting this issue. I will look into this. @mrwyattii, thanks for reproducing this. Yes, I think the issue is probably somewhere else. Thanks, Reza
@tomeras91 I can confirm that I’m able to reproduce this error. I don’t think it has anything to do with
MAX_OUT_TOKES
. @RezaYazdaniAminabadi could you take a look at this?