question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[BUG] CUDA error with INT 8 inference

See original GitHub issue

Describe the bug I am trying to get started with implementing INT 8 inference on Deepspeed. But I am running into RuntimeError: CUDA error: an illegal memory access was encountered .

To Reproduce

Code:

I am interested in implementing INT8 inference with GPT2 styled models, the code I am running is the following:

import os
import torch
import deepspeed
from transformers import GPT2Tokenizer, GPT2LMHeadModel

local_rank = int(os.getenv('LOCAL_RANK', '-1'))
world_size = int(os.getenv('WORLD_SIZE', '1'))

tokenizer = GPT2Tokenizer.from_pretrained('gpt2-large')
model = GPT2LMHeadModel.from_pretrained('gpt2-large').to(local_rank)

model = deepspeed.init_inference(
    model,
    mp_size=world_size,
    dtype=torch.int8,
    replace_method = 'auto',
    quantization_setting=2,
    replace_with_kernel_inject=True
)

input_ids = tokenizer.encode('Example context for testing ',return_tensors='pt')
input_ids = input_ids.to(local_rank)
outputs = model(input_ids)
print(outputs)

I am running this with

  • deepspeed version 0.5.9
  • cuda version 11.3
  • torch version 1.9
  • transformers version 4.16.2
  • python version 3.8.5
  • NVIDIA Tesla V100 GPUs are being used

I noticed few bugs blocking INT8 inference and I made the following changes to the source code:

  1. around line 132 in deepspeed/runtime/weight_quantizer.py
            for key in range(len(keys)):
                #if self.mlp_extra_grouping and is_mlp(keys[key]): # line removed
                if self.mlp_extra_grouping and self.is_mlp(keys[key])>=2: # line added

as is_mlp was not defined

  1. around line 161 in deepspeed/runtime/weight_quantizer.py
        else:
            for plcy in replace_policies:
                _ = plcy(None) # line added
                policy.update({plcy._orig_layer_class: (quantize_fn, plcy)})
  1. at line 282 in deepspeed/ops/inference/transformer_inference.py
            # context_layer, key_layer, value_layer = compute_attention(qkv_out) # line removed
            context_layer, key_layer, value_layer = compute_attention(qkv_out, input_mask) # line added

Expected behavior Output meaningful logits

ds_report output

--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
cpu_adam ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
sparse_attn ............ [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io: please install the libaio-devel package with yum
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
async_io ............... [NO] ....... [NO]
transformer_inference .. [NO] ....... [OKAY]
utils .................. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/home/ec2-user/anaconda3/envs/pretrain-vector/lib/python3.8/site-packages/torch']
torch version .................... 1.10.2+cu113
torch cuda version ............... 11.3
nvcc version ..................... 11.3
deepspeed install path ........... ['/home/ec2-user/anaconda3/envs/pretrain-vector/lib/python3.8/site-packages/deepspeed']
deepspeed info ................... 0.5.9, unknown, unknown
deepspeed wheel compiled w. ...... torch 1.9, cuda 11.1

Outputs While the code runs error free with dtype=torch.float and dtype=torch.half I am running into errors with dtype=torch.int8

running CUDA_VISIBLE_DEVICES=1 CUDA_LAUNCH_BLOCKING=1 deepspeed gpt_example.py results in the following output:

[2022-02-23 16:00:20,231] [WARNING] [runner.py:132:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
Detected CUDA_VISIBLE_DEVICES=1: setting --include=localhost:1
[2022-02-23 16:00:20,308] [INFO] [runner.py:398:main] cmd = /home/ec2-user/anaconda3/envs/pretrain-vector/bin/python3.8 -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMV19 --master_addr=127.0.0.1 --master_port=29500 gpt_example.py
[2022-02-23 16:00:21,307] [INFO] [launch.py:80:main] WORLD INFO DICT: {'localhost': [1]}
[2022-02-23 16:00:21,308] [INFO] [launch.py:86:main] nnodes=1, num_local_procs=1, node_rank=0
[2022-02-23 16:00:21,308] [INFO] [launch.py:99:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0]})
[2022-02-23 16:00:21,308] [INFO] [launch.py:100:main] dist_world_size=1
[2022-02-23 16:00:21,308] [INFO] [launch.py:102:main] Setting CUDA_VISIBLE_DEVICES=1
[2022-02-23 16:00:35,424] [INFO] [logging.py:69:log_dist] [Rank -1] DeepSpeed info: version=0.5.9, git-hash=unknown, git-branch=unknown
[2022-02-23 16:00:35,424] [INFO] [engine.py:127:_init_quantization_setting] quantize_bits = 8 mlp_extra_grouping = False, quantize_groups = 2
Using /home/ec2-user/.cache/torch_extensions/py38_cu113 as PyTorch extensions root...
/home/ec2-user/anaconda3/envs/pretrain-vector/lib/python3.8/site-packages/torch/utils/cpp_extension.py:295: UserWarning: 

                               !! WARNING !!

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
Your compiler (c++) is not compatible with the compiler Pytorch was
built with for this platform, which is g++ on linux. Please
use g++ to to compile your extension. Alternatively, you may
compile PyTorch from source using c++, and then you can also use
c++ to compile your extension.

See https://github.com/pytorch/pytorch/blob/master/CONTRIBUTING.md for help
with compiling PyTorch from source.
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

                              !! WARNING !!

  warnings.warn(WRONG_COMPILER_WARNING.format(
Detected CUDA files, patching ldflags
Emitting ninja build file /home/ec2-user/.cache/torch_extensions/py38_cu113/transformer_inference/build.ninja...
Building extension module transformer_inference...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module transformer_inference...
Time to load transformer_inference op: 0.3160672187805176 seconds
DeepSpeed Transformer Inference config is  {'layer_id': 0, 'hidden_size': 1280, 'intermediate_size': 5120, 'heads': 20, 'num_hidden_layers': -1, 'fp16': False, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-12, 'mp_size': 1, 'q_int8': True, 'scale_attention': True, 'specialized_mode': False, 'triangular_masking': True, 'local_attention': False, 'window_size': 1, 'return_tuple': True}
DeepSpeed Transformer Inference config is  {'layer_id': 1, 'hidden_size': 1280, 'intermediate_size': 5120, 'heads': 20, 'num_hidden_layers': -1, 'fp16': False, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-12, 'mp_size': 1, 'q_int8': True, 'scale_attention': True, 'specialized_mode': False, 'triangular_masking': True, 'local_attention': False, 'window_size': 1, 'return_tuple': True}
DeepSpeed Transformer Inference config is  {'layer_id': 2, 'hidden_size': 1280, 'intermediate_size': 5120, 'heads': 20, 'num_hidden_layers': -1, 'fp16': False, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-12, 'mp_size': 1, 'q_int8': True, 'scale_attention': True, 'specialized_mode': False, 'triangular_masking': True, 'local_attention': False, 'window_size': 1, 'return_tuple': True}
DeepSpeed Transformer Inference config is  {'layer_id': 3, 'hidden_size': 1280, 'intermediate_size': 5120, 'heads': 20, 'num_hidden_layers': -1, 'fp16': False, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-12, 'mp_size': 1, 'q_int8': True, 'scale_attention': True, 'specialized_mode': False, 'triangular_masking': True, 'local_attention': False, 'window_size': 1, 'return_tuple': True}
DeepSpeed Transformer Inference config is  {'layer_id': 4, 'hidden_size': 1280, 'intermediate_size': 5120, 'heads': 20, 'num_hidden_layers': -1, 'fp16': False, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-12, 'mp_size': 1, 'q_int8': True, 'scale_attention': True, 'specialized_mode': False, 'triangular_masking': True, 'local_attention': False, 'window_size': 1, 'return_tuple': True}
DeepSpeed Transformer Inference config is  {'layer_id': 5, 'hidden_size': 1280, 'intermediate_size': 5120, 'heads': 20, 'num_hidden_layers': -1, 'fp16': False, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-12, 'mp_size': 1, 'q_int8': True, 'scale_attention': True, 'specialized_mode': False, 'triangular_masking': True, 'local_attention': False, 'window_size': 1, 'return_tuple': True}
DeepSpeed Transformer Inference config is  {'layer_id': 6, 'hidden_size': 1280, 'intermediate_size': 5120, 'heads': 20, 'num_hidden_layers': -1, 'fp16': False, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-12, 'mp_size': 1, 'q_int8': True, 'scale_attention': True, 'specialized_mode': False, 'triangular_masking': True, 'local_attention': False, 'window_size': 1, 'return_tuple': True}
DeepSpeed Transformer Inference config is  {'layer_id': 7, 'hidden_size': 1280, 'intermediate_size': 5120, 'heads': 20, 'num_hidden_layers': -1, 'fp16': False, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-12, 'mp_size': 1, 'q_int8': True, 'scale_attention': True, 'specialized_mode': False, 'triangular_masking': True, 'local_attention': False, 'window_size': 1, 'return_tuple': True}
DeepSpeed Transformer Inference config is  {'layer_id': 8, 'hidden_size': 1280, 'intermediate_size': 5120, 'heads': 20, 'num_hidden_layers': -1, 'fp16': False, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-12, 'mp_size': 1, 'q_int8': True, 'scale_attention': True, 'specialized_mode': False, 'triangular_masking': True, 'local_attention': False, 'window_size': 1, 'return_tuple': True}
DeepSpeed Transformer Inference config is  {'layer_id': 9, 'hidden_size': 1280, 'intermediate_size': 5120, 'heads': 20, 'num_hidden_layers': -1, 'fp16': False, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-12, 'mp_size': 1, 'q_int8': True, 'scale_attention': True, 'specialized_mode': False, 'triangular_masking': True, 'local_attention': False, 'window_size': 1, 'return_tuple': True}
DeepSpeed Transformer Inference config is  {'layer_id': 10, 'hidden_size': 1280, 'intermediate_size': 5120, 'heads': 20, 'num_hidden_layers': -1, 'fp16': False, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-12, 'mp_size': 1, 'q_int8': True, 'scale_attention': True, 'specialized_mode': False, 'triangular_masking': True, 'local_attention': False, 'window_size': 1, 'return_tuple': True}
DeepSpeed Transformer Inference config is  {'layer_id': 11, 'hidden_size': 1280, 'intermediate_size': 5120, 'heads': 20, 'num_hidden_layers': -1, 'fp16': False, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-12, 'mp_size': 1, 'q_int8': True, 'scale_attention': True, 'specialized_mode': False, 'triangular_masking': True, 'local_attention': False, 'window_size': 1, 'return_tuple': True}
DeepSpeed Transformer Inference config is  {'layer_id': 12, 'hidden_size': 1280, 'intermediate_size': 5120, 'heads': 20, 'num_hidden_layers': -1, 'fp16': False, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-12, 'mp_size': 1, 'q_int8': True, 'scale_attention': True, 'specialized_mode': False, 'triangular_masking': True, 'local_attention': False, 'window_size': 1, 'return_tuple': True}
DeepSpeed Transformer Inference config is  {'layer_id': 13, 'hidden_size': 1280, 'intermediate_size': 5120, 'heads': 20, 'num_hidden_layers': -1, 'fp16': False, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-12, 'mp_size': 1, 'q_int8': True, 'scale_attention': True, 'specialized_mode': False, 'triangular_masking': True, 'local_attention': False, 'window_size': 1, 'return_tuple': True}
DeepSpeed Transformer Inference config is  {'layer_id': 14, 'hidden_size': 1280, 'intermediate_size': 5120, 'heads': 20, 'num_hidden_layers': -1, 'fp16': False, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-12, 'mp_size': 1, 'q_int8': True, 'scale_attention': True, 'specialized_mode': False, 'triangular_masking': True, 'local_attention': False, 'window_size': 1, 'return_tuple': True}
DeepSpeed Transformer Inference config is  {'layer_id': 15, 'hidden_size': 1280, 'intermediate_size': 5120, 'heads': 20, 'num_hidden_layers': -1, 'fp16': False, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-12, 'mp_size': 1, 'q_int8': True, 'scale_attention': True, 'specialized_mode': False, 'triangular_masking': True, 'local_attention': False, 'window_size': 1, 'return_tuple': True}
DeepSpeed Transformer Inference config is  {'layer_id': 16, 'hidden_size': 1280, 'intermediate_size': 5120, 'heads': 20, 'num_hidden_layers': -1, 'fp16': False, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-12, 'mp_size': 1, 'q_int8': True, 'scale_attention': True, 'specialized_mode': False, 'triangular_masking': True, 'local_attention': False, 'window_size': 1, 'return_tuple': True}
DeepSpeed Transformer Inference config is  {'layer_id': 17, 'hidden_size': 1280, 'intermediate_size': 5120, 'heads': 20, 'num_hidden_layers': -1, 'fp16': False, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-12, 'mp_size': 1, 'q_int8': True, 'scale_attention': True, 'specialized_mode': False, 'triangular_masking': True, 'local_attention': False, 'window_size': 1, 'return_tuple': True}
DeepSpeed Transformer Inference config is  {'layer_id': 18, 'hidden_size': 1280, 'intermediate_size': 5120, 'heads': 20, 'num_hidden_layers': -1, 'fp16': False, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-12, 'mp_size': 1, 'q_int8': True, 'scale_attention': True, 'specialized_mode': False, 'triangular_masking': True, 'local_attention': False, 'window_size': 1, 'return_tuple': True}
DeepSpeed Transformer Inference config is  {'layer_id': 19, 'hidden_size': 1280, 'intermediate_size': 5120, 'heads': 20, 'num_hidden_layers': -1, 'fp16': False, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-12, 'mp_size': 1, 'q_int8': True, 'scale_attention': True, 'specialized_mode': False, 'triangular_masking': True, 'local_attention': False, 'window_size': 1, 'return_tuple': True}
DeepSpeed Transformer Inference config is  {'layer_id': 20, 'hidden_size': 1280, 'intermediate_size': 5120, 'heads': 20, 'num_hidden_layers': -1, 'fp16': False, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-12, 'mp_size': 1, 'q_int8': True, 'scale_attention': True, 'specialized_mode': False, 'triangular_masking': True, 'local_attention': False, 'window_size': 1, 'return_tuple': True}
DeepSpeed Transformer Inference config is  {'layer_id': 21, 'hidden_size': 1280, 'intermediate_size': 5120, 'heads': 20, 'num_hidden_layers': -1, 'fp16': False, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-12, 'mp_size': 1, 'q_int8': True, 'scale_attention': True, 'specialized_mode': False, 'triangular_masking': True, 'local_attention': False, 'window_size': 1, 'return_tuple': True}
DeepSpeed Transformer Inference config is  {'layer_id': 22, 'hidden_size': 1280, 'intermediate_size': 5120, 'heads': 20, 'num_hidden_layers': -1, 'fp16': False, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-12, 'mp_size': 1, 'q_int8': True, 'scale_attention': True, 'specialized_mode': False, 'triangular_masking': True, 'local_attention': False, 'window_size': 1, 'return_tuple': True}
DeepSpeed Transformer Inference config is  {'layer_id': 23, 'hidden_size': 1280, 'intermediate_size': 5120, 'heads': 20, 'num_hidden_layers': -1, 'fp16': False, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-12, 'mp_size': 1, 'q_int8': True, 'scale_attention': True, 'specialized_mode': False, 'triangular_masking': True, 'local_attention': False, 'window_size': 1, 'return_tuple': True}
DeepSpeed Transformer Inference config is  {'layer_id': 24, 'hidden_size': 1280, 'intermediate_size': 5120, 'heads': 20, 'num_hidden_layers': -1, 'fp16': False, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-12, 'mp_size': 1, 'q_int8': True, 'scale_attention': True, 'specialized_mode': False, 'triangular_masking': True, 'local_attention': False, 'window_size': 1, 'return_tuple': True}
DeepSpeed Transformer Inference config is  {'layer_id': 25, 'hidden_size': 1280, 'intermediate_size': 5120, 'heads': 20, 'num_hidden_layers': -1, 'fp16': False, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-12, 'mp_size': 1, 'q_int8': True, 'scale_attention': True, 'specialized_mode': False, 'triangular_masking': True, 'local_attention': False, 'window_size': 1, 'return_tuple': True}
DeepSpeed Transformer Inference config is  {'layer_id': 26, 'hidden_size': 1280, 'intermediate_size': 5120, 'heads': 20, 'num_hidden_layers': -1, 'fp16': False, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-12, 'mp_size': 1, 'q_int8': True, 'scale_attention': True, 'specialized_mode': False, 'triangular_masking': True, 'local_attention': False, 'window_size': 1, 'return_tuple': True}
DeepSpeed Transformer Inference config is  {'layer_id': 27, 'hidden_size': 1280, 'intermediate_size': 5120, 'heads': 20, 'num_hidden_layers': -1, 'fp16': False, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-12, 'mp_size': 1, 'q_int8': True, 'scale_attention': True, 'specialized_mode': False, 'triangular_masking': True, 'local_attention': False, 'window_size': 1, 'return_tuple': True}
DeepSpeed Transformer Inference config is  {'layer_id': 28, 'hidden_size': 1280, 'intermediate_size': 5120, 'heads': 20, 'num_hidden_layers': -1, 'fp16': False, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-12, 'mp_size': 1, 'q_int8': True, 'scale_attention': True, 'specialized_mode': False, 'triangular_masking': True, 'local_attention': False, 'window_size': 1, 'return_tuple': True}
DeepSpeed Transformer Inference config is  {'layer_id': 29, 'hidden_size': 1280, 'intermediate_size': 5120, 'heads': 20, 'num_hidden_layers': -1, 'fp16': False, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-12, 'mp_size': 1, 'q_int8': True, 'scale_attention': True, 'specialized_mode': False, 'triangular_masking': True, 'local_attention': False, 'window_size': 1, 'return_tuple': True}
DeepSpeed Transformer Inference config is  {'layer_id': 30, 'hidden_size': 1280, 'intermediate_size': 5120, 'heads': 20, 'num_hidden_layers': -1, 'fp16': False, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-12, 'mp_size': 1, 'q_int8': True, 'scale_attention': True, 'specialized_mode': False, 'triangular_masking': True, 'local_attention': False, 'window_size': 1, 'return_tuple': True}
DeepSpeed Transformer Inference config is  {'layer_id': 31, 'hidden_size': 1280, 'intermediate_size': 5120, 'heads': 20, 'num_hidden_layers': -1, 'fp16': False, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-12, 'mp_size': 1, 'q_int8': True, 'scale_attention': True, 'specialized_mode': False, 'triangular_masking': True, 'local_attention': False, 'window_size': 1, 'return_tuple': True}
DeepSpeed Transformer Inference config is  {'layer_id': 32, 'hidden_size': 1280, 'intermediate_size': 5120, 'heads': 20, 'num_hidden_layers': -1, 'fp16': False, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-12, 'mp_size': 1, 'q_int8': True, 'scale_attention': True, 'specialized_mode': False, 'triangular_masking': True, 'local_attention': False, 'window_size': 1, 'return_tuple': True}
DeepSpeed Transformer Inference config is  {'layer_id': 33, 'hidden_size': 1280, 'intermediate_size': 5120, 'heads': 20, 'num_hidden_layers': -1, 'fp16': False, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-12, 'mp_size': 1, 'q_int8': True, 'scale_attention': True, 'specialized_mode': False, 'triangular_masking': True, 'local_attention': False, 'window_size': 1, 'return_tuple': True}
DeepSpeed Transformer Inference config is  {'layer_id': 34, 'hidden_size': 1280, 'intermediate_size': 5120, 'heads': 20, 'num_hidden_layers': -1, 'fp16': False, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-12, 'mp_size': 1, 'q_int8': True, 'scale_attention': True, 'specialized_mode': False, 'triangular_masking': True, 'local_attention': False, 'window_size': 1, 'return_tuple': True}
DeepSpeed Transformer Inference config is  {'layer_id': 35, 'hidden_size': 1280, 'intermediate_size': 5120, 'heads': 20, 'num_hidden_layers': -1, 'fp16': False, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-12, 'mp_size': 1, 'q_int8': True, 'scale_attention': True, 'specialized_mode': False, 'triangular_masking': True, 'local_attention': False, 'window_size': 1, 'return_tuple': True}
[2022-02-23 16:00:36,481] [INFO] [engine.py:91:__init__] Place model to device: 0

!!!! kernel execution error. (batch: 20, m: 64, n: 5, k: 5, error: 13) 
Traceback (most recent call last):
  File "DeepSpeedExperiment.py", line 24, in <module>
    outputs = model(input_ids)
  File "/home/ec2-user/anaconda3/envs/pretrain-vector/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/ec2-user/anaconda3/envs/pretrain-vector/lib/python3.8/site-packages/deepspeed/inference/engine.py", line 246, in forward
    outputs = self.module(*inputs, **kwargs)
  File "/home/ec2-user/anaconda3/envs/pretrain-vector/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1120, in _call_impl
    result = forward_call(*input, **kwargs)
  File "/home/ec2-user/anaconda3/envs/pretrain-vector/lib/python3.8/site-packages/transformers/models/gpt2/modeling_gpt2.py", line 1047, in forward
    transformer_outputs = self.transformer(
  File "/home/ec2-user/anaconda3/envs/pretrain-vector/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/ec2-user/anaconda3/envs/pretrain-vector/lib/python3.8/site-packages/transformers/models/gpt2/modeling_gpt2.py", line 890, in forward
    outputs = block(
  File "/home/ec2-user/anaconda3/envs/pretrain-vector/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/ec2-user/anaconda3/envs/pretrain-vector/lib/python3.8/site-packages/deepspeed/ops/transformer/inference/transformer_inference.py", line 587, in forward
    attention_output = self.attention(input,
  File "/home/ec2-user/anaconda3/envs/pretrain-vector/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/ec2-user/anaconda3/envs/pretrain-vector/lib/python3.8/site-packages/deepspeed/ops/transformer/inference/transformer_inference.py", line 355, in forward
    output = DeepSpeedSelfAttentionFunction.apply(
  File "/home/ec2-user/anaconda3/envs/pretrain-vector/lib/python3.8/site-packages/deepspeed/ops/transformer/inference/transformer_inference.py", line 291, in forward
    output, key_layer, value_layer, context_layer = selfAttention_int8()
  File "/home/ec2-user/anaconda3/envs/pretrain-vector/lib/python3.8/site-packages/deepspeed/ops/transformer/inference/transformer_inference.py", line 282, in selfAttention_int8
    context_layer, key_layer, value_layer = compute_attention(qkv_out, input_mask)
  File "/home/ec2-user/anaconda3/envs/pretrain-vector/lib/python3.8/site-packages/deepspeed/ops/transformer/inference/transformer_inference.py", line 236, in compute_attention
    context_layer = _transpose_for_context(context_layer)
  File "/home/ec2-user/anaconda3/envs/pretrain-vector/lib/python3.8/site-packages/deepspeed/ops/transformer/inference/transformer_inference.py", line 144, in _transpose_for_context
    x = x.permute(0, 2, 1, 3).contiguous()
RuntimeError: CUDA error: an illegal memory access was encountered
[2022-02-23 16:00:37,334] [INFO] [launch.py:131:sigkill_handler] Killing subprocess 52992
[2022-02-23 16:00:37,334] [ERROR] [launch.py:137:sigkill_handler] ['/home/ec2-user/anaconda3/envs/pretrain-vector/bin/python3.8', '-u', 'DeepSpeedExperiment.py', '--local_rank=0'] exits with return code = 1

Launcher context Using deepspeed launcher

Issue Analytics

  • State:open
  • Created 2 years ago
  • Reactions:1
  • Comments:8 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
RezaYazdaniAminabadicommented, Mar 1, 2022

Sorry for the delay, @gsujankumar I will try to make it work by the end of this week or early next week.

0reactions
gsujankumarcommented, Mar 18, 2022

Sure, I will create a PR soon.

Yes, we are using the quantizer kernels from DeepSpeed. Can you check if there are any issues with groups?

Read more comments on GitHub >

github_iconTop Results From Across the Web

INT8 quantization with Torch-TensorRT fails
The second bug description is found in its GitHub issue page: a segmentation fault occurs when using int8 quantization with TensorRT.
Read more >
Running Tensorflow Model on Edge with TensorRT for Fast ...
On inference, we see that with the lower precision of weight values, INT8 quantized model produced greater variation in the prediction ...
Read more >
Generate INT8 Code for Deep Learning Networks - MathWorks
In this example, you use GPU Coder to generate optmized CUDA code for a deep convolutional neural network and classify an image. The...
Read more >
Inference error after int8 quantization with pytorch
NotImplementedError: Could not run 'quantized::conv2d.new ' with arguments from the 'CPU' backend. This could be because the operator doesn't ...
Read more >
High performance inference with TensorRT Integration
Models supported and integration workflow; New techniques such as quantization aware training to use with INT8 precision; Profiling techniques ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found