GPT-Neo inference issues
See original GitHub issueAttempting to do inference using the new inference support for gpt-neo models, I encountered multiple issues. I used the following code on Google Colab for testing:
Setup:
!pip install transformers==4.6.1
!pip install git+https://github.com/microsoft/DeepSpeed # d2cf66a66847aa7f8d25da5708b7016e54f29e0a
!pip install mpi4py
!rm /usr/local/cuda
!ln -s /usr/local/cuda-10.1 /usr/local/cuda
Inference:
import deepspeed
from transformers import GPTNeoForCausalLM, AutoTokenizer
import torch
import time
#model_name = "EleutherAI/gpt-neo-2.7B"
model_name = "EleutherAI/gpt-neo-125M"
from transformers.file_utils import cached_path, WEIGHTS_NAME, hf_bucket_url
archive_file = hf_bucket_url(model_name, filename=WEIGHTS_NAME)
resolved_archive_file = cached_path(archive_file)
checkpoint = torch.load(resolved_archive_file, map_location="cuda:0")
for k in checkpoint.keys():
checkpoint[k] = checkpoint[k].half()
model = GPTNeoForCausalLM.from_pretrained(model_name, state_dict=checkpoint).half().to("cuda").eval()
for k in list(checkpoint.keys()):
del checkpoint[k]
del checkpoint
model = model.float()
tokenizer = AutoTokenizer.from_pretrained("gpt2")
torch.cuda.empty_cache()
ds_engine = deepspeed.init_inference(model,
mp_size=1,
dtype=torch.float,#half,
replace_method='auto')
model = ds_engine.module
ids = tokenizer.encode("A valley full of unicorns was discovered", return_tensors="pt").cuda()
outputs = model.generate(ids, use_cache=True, do_sample=True, min_length=1024, remove_invalid_values=True, max_length=1024, pad_token_id=tokenizer.eos_token_id)
print(tokenizer.decode(outputs[0]))
None policy
During loading some issues occured. First issue:
[2021-05-27 10:10:05,407] [INFO] [logging.py:60:log_dist] [Rank -1] DeepSpeed info: version=0.4.0+d2cf66a, git-hash=d2cf66a, git-branch=master
---------------------------------------------------------------------------
AssertionError Traceback (most recent call last)
<ipython-input-2-80433536d61b> in <module>()
23 mp_size=1,
24 dtype=torch.float,#half,
---> 25 replace_method='auto')
26 model = ds_engine.module
27
4 frames
/usr/local/lib/python3.7/dist-packages/deepspeed/module_inject/replace_module.py in replace_module(model, orig_class, replace_fn, _replace_policy)
390 # instantiate a throw-away policy in order to populate the _orig_layer_class
391 _ = plcy(None)
--> 392 assert plcy._orig_layer_class != None
393 policy.update({plcy._orig_layer_class: (replace_fn, plcy)})
394
AssertionError:
I fixed this issue by changing lines 392-393:
if plcy._orig_layer_class is not None:
policy.update({plcy._orig_layer_class: (replace_fn, plcy)})
Casting None to half precision
The next issue only occurs when loading the model with half precision:
[2021-05-27 10:12:47,942] [INFO] [logging.py:60:log_dist] [Rank -1] DeepSpeed info: version=0.4.0+d2cf66a, git-hash=d2cf66a, git-branch=master
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-1-fb492221c962> in <module>()
23 mp_size=1,
24 dtype=torch.half,
---> 25 replace_method='auto')
26 model = ds_engine.module
27
9 frames
/usr/local/lib/python3.7/dist-packages/deepspeed/module_inject/replace_module.py in replace_with_policy(child, policy_cls, inference, preln, layer_id)
156
157 if quantize or fp16:
--> 158 qkvb = qkvb.half()
159 dense_b = dense_b.half()
160 _h4h_b = _h4h_b.half()
AttributeError: 'NoneType' object has no attribute 'half'
This seems to be caused by policy.attention() returning None for qkvb. Both commenting it out or adding an if to only do this cast when qkvb is not None will let the model be initialized, but generations look very broken.
Local attention not working correctly
Going back to fp32 inference, with the above modifications the model is initializes correctly. However, after about 512 tokens (local attention window size) generation becomes very broken. Following is the output of the critical area:
print(tokenizer.decode(outputs[0, 480:580]))
, but they did this. They continued to have their own activities. They did not have any money – people didn’t go out to the mountain to pay their own way. They did not pay to the village, and it was a village which would not reach its goals.
Ital was that they did.
They did not.
The river did not go to the valley goes, they do. It
the river go, there
I encountered generations looking like this when fixing a memory issue with the Huggingface local attention implementation as well as when loading gpt-neo models as gpt2 whenever there were issues with the local attention mechanism. In gpt-neo, every second (or otherwise defined by the config) layer uses local attention, where the causal mask is not a triangle mask, but a sliding window of a given window size. Using the HF gpt2 implementation, this can be implemented using a bias matrix as follows:
max_positions = config.n_ctx
window_size = config.window_size
bias = torch.tril(torch.ones((max_positions, max_positions), dtype=torch.uint8)).view(1, 1, max_positions, max_positions).bool()
local_bias = bias ^ torch.tril(bias, -window_size)
for i in range(config.n_layer):
if config.attention_layers[i] == "local":
converted[f"transformer.h.{i}.attn.bias"] = local_bias
elif config.attention_layers[i] == "global":
converted[f"transformer.h.{i}.attn.bias"] = bias
Looking at the deepspeed code implementing gpt-neo, I found no handling of local attention layers and suspect they are treated as global attention, which leads to broken generations when the context becomes longer than the local attention window.
Issue Analytics
- State:
- Created 2 years ago
- Comments:11 (5 by maintainers)
Top GitHub Comments
Hi @finetuneanon and @kurumuz
I have pushed some changes in the PR that solves the problem with higher sequence-length. Please give it a try when you get a chance.
Thanks, Reza
Hi @RezaYazdaniAminabadi, thanks for the quick fixes. I tested them and inference with fp32 works correctly now. Instantiating as fp16 is possible without errors as well now, but the results are still broken. Is it possible to look into this again? I understand that gpt-neo-1.3B was trained as bfloat16 and gpt-neo-2.7B was trained as fp32, however I have been successfully been doing inference in fp16 through the Huggingface implementation on it without encountering any nan or overflow issues for quite a while (although I had to modify their local attention mechanism to avoid OOM errors).