Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

GPT-Neo inference issues

See original GitHub issue

Attempting to do inference using the new inference support for gpt-neo models, I encountered multiple issues. I used the following code on Google Colab for testing:

Setup:

!pip install transformers==4.6.1
!pip install git+https://github.com/microsoft/DeepSpeed # d2cf66a66847aa7f8d25da5708b7016e54f29e0a
!pip install mpi4py
!rm /usr/local/cuda
!ln -s /usr/local/cuda-10.1 /usr/local/cuda

Inference:

import deepspeed
from transformers import GPTNeoForCausalLM, AutoTokenizer
import torch
import time

#model_name = "EleutherAI/gpt-neo-2.7B"
model_name = "EleutherAI/gpt-neo-125M"
from transformers.file_utils import cached_path, WEIGHTS_NAME, hf_bucket_url
archive_file = hf_bucket_url(model_name, filename=WEIGHTS_NAME)
resolved_archive_file = cached_path(archive_file)
checkpoint = torch.load(resolved_archive_file, map_location="cuda:0")
for k in checkpoint.keys():
  checkpoint[k] = checkpoint[k].half()
model = GPTNeoForCausalLM.from_pretrained(model_name, state_dict=checkpoint).half().to("cuda").eval()
for k in list(checkpoint.keys()):
  del checkpoint[k]
del checkpoint
model = model.float()
tokenizer = AutoTokenizer.from_pretrained("gpt2")
torch.cuda.empty_cache()

ds_engine = deepspeed.init_inference(model,
                                    mp_size=1,
                                    dtype=torch.float,#half,
                                    replace_method='auto')
model = ds_engine.module

ids = tokenizer.encode("A valley full of unicorns was discovered", return_tensors="pt").cuda()
outputs = model.generate(ids, use_cache=True, do_sample=True, min_length=1024, remove_invalid_values=True, max_length=1024, pad_token_id=tokenizer.eos_token_id)
print(tokenizer.decode(outputs[0]))

None policy

During loading some issues occured. First issue:

[2021-05-27 10:10:05,407] [INFO] [logging.py:60:log_dist] [Rank -1] DeepSpeed info: version=0.4.0+d2cf66a, git-hash=d2cf66a, git-branch=master
---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<ipython-input-2-80433536d61b> in <module>()
     23                                     mp_size=1,
     24                                     dtype=torch.float,#half,
---> 25                                     replace_method='auto')
     26 model = ds_engine.module
     27 

4 frames
/usr/local/lib/python3.7/dist-packages/deepspeed/module_inject/replace_module.py in replace_module(model, orig_class, replace_fn, _replace_policy)
    390             # instantiate a throw-away policy in order to populate the _orig_layer_class
    391             _ = plcy(None)
--> 392             assert plcy._orig_layer_class != None
    393             policy.update({plcy._orig_layer_class: (replace_fn, plcy)})
    394 

AssertionError:

I fixed this issue by changing lines 392-393:

            if plcy._orig_layer_class is not None:
              policy.update({plcy._orig_layer_class: (replace_fn, plcy)})

Casting None to half precision

The next issue only occurs when loading the model with half precision:

[2021-05-27 10:12:47,942] [INFO] [logging.py:60:log_dist] [Rank -1] DeepSpeed info: version=0.4.0+d2cf66a, git-hash=d2cf66a, git-branch=master
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-1-fb492221c962> in <module>()
     23                                     mp_size=1,
     24                                     dtype=torch.half,
---> 25                                     replace_method='auto')
     26 model = ds_engine.module
     27 

9 frames
/usr/local/lib/python3.7/dist-packages/deepspeed/module_inject/replace_module.py in replace_with_policy(child, policy_cls, inference, preln, layer_id)
    156 
    157         if quantize or fp16:
--> 158             qkvb = qkvb.half()
    159             dense_b = dense_b.half()
    160             _h4h_b = _h4h_b.half()

AttributeError: 'NoneType' object has no attribute 'half'

This seems to be caused by policy.attention() returning None for qkvb. Both commenting it out or adding an if to only do this cast when qkvb is not None will let the model be initialized, but generations look very broken.

Local attention not working correctly

Going back to fp32 inference, with the above modifications the model is initializes correctly. However, after about 512 tokens (local attention window size) generation becomes very broken. Following is the output of the critical area:

print(tokenizer.decode(outputs[0, 480:580]))

, but they did this. They continued to have their own activities. They did not have any money – people didn’t go out to the mountain to pay their own way. They did not pay to the village, and it was a village which would not reach its goals.


Ital was that they did.


They did not.

The river did not go to the valley goes, they do. It


the river go, there

I encountered generations looking like this when fixing a memory issue with the Huggingface local attention implementation as well as when loading gpt-neo models as gpt2 whenever there were issues with the local attention mechanism. In gpt-neo, every second (or otherwise defined by the config) layer uses local attention, where the causal mask is not a triangle mask, but a sliding window of a given window size. Using the HF gpt2 implementation, this can be implemented using a bias matrix as follows:

max_positions = config.n_ctx
window_size = config.window_size
bias = torch.tril(torch.ones((max_positions, max_positions), dtype=torch.uint8)).view(1, 1, max_positions, max_positions).bool()
local_bias = bias ^ torch.tril(bias, -window_size)
for i in range(config.n_layer):
    if config.attention_layers[i] == "local":
        converted[f"transformer.h.{i}.attn.bias"] = local_bias
    elif config.attention_layers[i] == "global":
        converted[f"transformer.h.{i}.attn.bias"] = bias

Looking at the deepspeed code implementing gpt-neo, I found no handling of local attention layers and suspect they are treated as global attention, which leads to broken generations when the context becomes longer than the local attention window.

Issue Analytics

State:
Created 2 years ago
Comments:11 (5 by maintainers)

Top GitHub Comments

1reaction

RezaYazdaniAminabadicommented, Jun 8, 2021

Hi @finetuneanon and @kurumuz

I have pushed some changes in the PR that solves the problem with higher sequence-length. Please give it a try when you get a chance.

Thanks, Reza

1reaction

finetuneanoncommented, May 28, 2021

Hi @RezaYazdaniAminabadi, thanks for the quick fixes. I tested them and inference with fp32 works correctly now. Instantiating as fp16 is possible without errors as well now, but the results are still broken. Is it possible to look into this again? I understand that gpt-neo-1.3B was trained as bfloat16 and gpt-neo-2.7B was trained as fp32, however I have been successfully been doing inference in fp16 through the Huggingface implementation on it without encountering any nan or overflow issues for quite a while (although I had to modify their local attention mechanism to avoid OOM errors).