question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Irregular VRAM usage with gpt-neo inference with sequences longer than 250 tokens

See original GitHub issue

Environment info

  • transformers version: 4.5.1 / HEAD
  • Platform: Linux/Colab Pro
  • Python version: 3.7
  • PyTorch version (GPU?): 1.8.1 (CUDA 11.0)
  • Tensorflow version (GPU?):
  • Using GPU in script?: Yes, NVIDIA P100
  • Using distributed or parallel set-up in script?:

Who can help

@patil-suraj

Information

Model I am using (Bert, XLNet …): EleutherAI/gpt-neo-2.7B

The problem arises when using:

  • the official example scripts: (give details below)
  • my own modified scripts: (give details below)

The tasks I am working on is:

  • an official GLUE/SQUaD task: (give the name)
  • my own task or dataset: (give details below)

To reproduce

Steps to reproduce the behavior:

  1. Install transformers in a Colab Pro notebook
  2. Run this script to log peak memory usage for inference with increasing sequence length: https://gist.github.com/finetuneanon/7ce0ed5090a27a383abffbbbc0433a29
  3. Wait for it to crash with an OOM error in the attention matmul somewhere above sequence length 1850

Output:

1870 5436434432
ok 6535669248
1871 5436434432
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-2-f2aeed4489bd> in <module>()
     21     return_dict_in_generate=True,
     22     repetition_penalty=1.2,
---> 23     pad_token_id=tokenizer.eos_token_id
     24   )
     25   del ids

13 frames
/usr/local/lib/python3.7/dist-packages/transformers/models/gpt_neo/modeling_gpt_neo.py in _attn(self, query, key, value, causal_mask, masked_bias, attn_dropout, attention_mask, head_mask)
    238         key = key.to(torch.float32)
    239 
--> 240         attn_weights = torch.matmul(query, key.transpose(-1, -2))
    241         attn_weights = torch.where(causal_mask, attn_weights, masked_bias.to(attn_weights.dtype))
    242 

RuntimeError: CUDA out of memory. Tried to allocate 4.59 GiB (GPU 0; 15.90 GiB total capacity; 9.75 GiB already allocated; 4.60 GiB free; 10.42 GiB reserved in total by PyTorch)

The full output can be found here: https://gist.github.com/finetuneanon/c7292ea676f57f5bb63803685d80bf5b

The output has the format:

sequence_length occupied_cuda_memory_before_inference
ok peak_occupied_cuda_memory_during_inference

Doing inference with real text has the same issue.

Expected behavior

I expected memory usage to increase steadily instead of jumping around wildly, but I am not sure if this might actually be the correct behaviour. If it is correct, reliably doing inference on long sequences on 16GB of VRAM seems to be impossible, but sometimes it works.

I have also plotted the peak memory allocation during inference:

cudaoom

The green line is peak memory allocation, the brown line is the amount of memory in use before running inference.

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:11 (4 by maintainers)

github_iconTop GitHub Comments

3reactions
patil-surajcommented, Apr 21, 2021

Hi @finetuneanon

Thanks for the detailed issue!

So what is happening here is, the way local attention is designed is a bit weird (not the implementation), in that it splits the seq_length dim into (num_blocks, block_length) but here block_length is actually dynamic.

It’s equal to window_size by default which is 256. But when the seq_length is not evenly divisible by block_length then it’s adjusted as follows

def _get_block_length_and_num_blocks(seq_length, window_size):
    """
    Computes ``block_length`` and ``num_blocks`` such that ``seq_length`` becomes evenly divisible by
    ``block_length``.
    """
    block_length = window_size
    while seq_length % block_length != 0:
        block_length -= 1
    num_blocks = seq_length // block_length
    return block_length, num_blocks

such that, the seq_length becomes evenly divisible by block_length.

So the shape of query becomes (batch, num_blocks, block_length, hidden_dim) and then the keys and values are padded and the seq_length dim is split such that their shape becomes (batch, num_blocks, window_size + block_length, hidden_dim).

Here’s a simple function to get the shape of query and key for given seq_length

def get_query_key_shape(seq_len, window_size, hidden_dim):
    block_length, num_blocks = _get_block_length_and_num_blocks(seq_len, window_size)
    query_shape = (1, num_blocks, block_length, hidden_dim)
    key_shape = (1, num_blocks, window_size + block_length, hidden_dim)
    return query_shape, key_shape

Let’s print the shapes for few lengths

window_size = 256
hidden_dim = 2560
for seq_len in range(256, 266):
    query_shape, key_shape = get_query_key_shape(seq_len, window_size, hidden_dim)
    print(f"seq_len: {seq_len}, query_shape: {query_shape}, key_shape: {key_shape}"

which gives

seq_len: 256, query_shape: (1, 1, 256, 2560), key_shape: (1, 1, 512, 2560)
seq_len: 257, query_shape: (1, 257, 1, 2560), key_shape: (1, 257, 257, 2560)
seq_len: 258, query_shape: (1, 2, 129, 2560), key_shape: (1, 2, 385, 2560)
seq_len: 259, query_shape: (1, 7, 37, 2560), key_shape: (1, 7, 293, 2560)
seq_len: 260, query_shape: (1, 2, 130, 2560), key_shape: (1, 2, 386, 2560)
seq_len: 261, query_shape: (1, 3, 87, 2560), key_shape: (1, 3, 343, 2560)
seq_len: 262, query_shape: (1, 2, 131, 2560), key_shape: (1, 2, 387, 2560)
seq_len: 263, query_shape: (1, 263, 1, 2560), key_shape: (1, 263, 257, 2560)
seq_len: 264, query_shape: (1, 2, 132, 2560), key_shape: (1, 2, 388, 2560)
seq_len: 265, query_shape: (1, 5, 53, 2560), key_shape: (1, 5, 309, 2560))

as you can see, because of the dynamic block_length the dimensions are very different for different seq_length which explains the irregular VRAM usage.

if you set the seq_length to 1871 you’ll get

seq_len: 1871, query_shape: (1, 1871, 1, 2560), key_shape: (1, 1871, 257, 2560)

as you posted above.

So I wouldn’t say this is an implementation issue, that’s how the local attention algorithm is designed in mesh-tf.

1reaction
patil-surajcommented, Apr 26, 2021

Great, I ran a small test and it seems to be working! (sorry about the earlier comment). Here’s the script

import torch
from torch import nn
from transformers.models.gpt_neo.modeling_gpt_neo import GPTNeoAttentionMixin
from transformers import GPTNeoConfig

class GPTNeoLocalSelfAttention(nn.Module, GPTNeoAttentionMixin):
    def __init__(self, config):
        super().__init__()

        self.register_buffer("masked_bias", torch.tensor(-1e9))

        self.attn_dropout = nn.Dropout(config.attention_dropout)
        self.resid_dropout = nn.Dropout(config.resid_dropout)

        self.embed_dim = config.hidden_size
        self.num_heads = config.num_heads
        self.head_dim = self.embed_dim // self.num_heads
        if self.head_dim * self.num_heads != self.embed_dim:
            raise ValueError(
                f"embed_dim must be divisible by num_heads (got `embed_dim`: {self.embed_dim} and `num_heads`: {self.num_heads})."
            )

        self.k_proj = nn.Linear(self.embed_dim, self.embed_dim, bias=False)
        self.v_proj = nn.Linear(self.embed_dim, self.embed_dim, bias=False)
        self.q_proj = nn.Linear(self.embed_dim, self.embed_dim, bias=False)
        self.out_proj = nn.Linear(self.embed_dim, self.embed_dim, bias=True)

        self.window_size = config.window_size

    def forward(
        self,
        hidden_states,
        attention_mask=None,
        layer_past=None,
        head_mask=None,
        use_cache=False,
        output_attentions=False,
        pad_qkv=False
    ):
        query = self.q_proj(hidden_states)

        if layer_past is not None:
            past = layer_past[0]
            key_value_hidden_states = torch.cat([past, hidden_states], dim=1)
            past_length = past.size()[1]
        else:
            key_value_hidden_states = hidden_states
            past_length = 0

        key = self.k_proj(key_value_hidden_states)
        value = self.v_proj(key_value_hidden_states)
        
       
        # compute block length and num_blocks
        batch_size, seq_length = hidden_states.shape[:2]
        full_seq_length = seq_length + past_length
        
        padding = None
        if pad_qkv:
            if layer_past is None and full_seq_length % self.window_size != 0 and full_seq_length > self.window_size:
                padding = self.window_size-(full_seq_length%self.window_size)
                if attention_mask is None:
                    attention_mask = torch.zeros(query.shape[0], query.shape[1] + padding).to(query.device)
                    attention_mask[:, padding:] = 1
                else:
                    attention_mask = torch.cat([torch.zeros(attention_mask.shape[0], padding).to(attention_mask.device), attention_mask], axis=1)
                pad = lambda x: torch.cat([torch.zeros(x.shape[0],padding,x.shape[2]).to(x.device), x], axis=1)
                query, key, value = map(pad, (query, key, value))
                seq_length += padding
                full_seq_length += padding
        
        block_length, num_blocks = self._get_block_length_and_num_blocks(full_seq_length, self.window_size)
        
        # create buckets
        if layer_past is not None:
            # we just need 1 block with block_length 1 when caching is enabled
            query = self._split_seq_length_dim_to(query, 1, 1)
        else:
            query = self._split_seq_length_dim_to(query, num_blocks, block_length)

        key = self._look_back(key, block_length, self.window_size)
        value = self._look_back(value, block_length, self.window_size)

        # select key/value vectors only for the last block
        if layer_past is not None:
            key = key[:, -1:, ...]
            value = value[:, -1:, ...]

        query = self._split_heads(query, self.num_heads, self.head_dim)
        key = self._split_heads(key, self.num_heads, self.head_dim)
        value = self._split_heads(value, self.num_heads, self.head_dim)
        
        attention_mask = GPTNeoAttentionMixin.create_local_attention_mask(
            batch_size, full_seq_length, self.window_size, "cpu", attention_mask
        )

        if layer_past is not None:
            # only take the mask for the last block
            attention_mask = attention_mask[:, -1:, :, -1:, :]

        # attn
        attn_output, attn_weights = self._attn(
            query,
            key,
            value,
            causal_mask=attention_mask,
            masked_bias=self.masked_bias,
            attn_dropout=self.attn_dropout,
            head_mask=head_mask,
        )

        attn_output = self._merge_heads(attn_output, self.num_heads, self.head_dim)
        attn_output = attn_output.reshape(batch_size, seq_length, self.embed_dim)
        
        if padding is not None:
            attn_output = attn_output[:,padding:]
            attn_weights = attn_weights[:,padding:]

        attn_output = self.out_proj(attn_output)
        attn_output = self.resid_dropout(attn_output)

        outputs = (attn_output,)
        if output_attentions:
            outputs += (attn_weights,)

        return outputs  # a, (attentions)

config = GPTNeoConfig(hidden_size=16, num_heads=4)
attn_layer = GPTNeoLocalSelfAttention(config).eval()

matched = []
with torch.no_grad():
    for seq_len in range(1, 2049):
        hidden_states = torch.randn(1, seq_len, 16)
        out = attn_layer(hidden_states)[0]
        out_with_padding = attn_layer(hidden_states, pad_qkv=True)[0]
        matched.append(torch.allclose(out, out_with_padding, atol=1e-5))

all(matched)
# True

I will run a few tests with the actual model and will let you know. If it works, feel free to open a PR 😃

Read more comments on GitHub >

github_iconTop Results From Across the Web

GPT Neo - Hugging Face
The architecture is similar to GPT2 except that GPT Neo uses local attention in every other layer with a window size of 256...
Read more >
Fine-Tune EleutherAI GPT-Neo to Generate Netflix Movie ...
The main goal is to show you the simplest way to fine-tune the GPT-Neo model to generate new movie descriptions using this dataset...
Read more >
[D] GPT-J for text generation: hardware requirements - Reddit
I'm trying to optimize inference here. I could use an inference framework like NVIDIA Triton + TensorRT, but not all models can be...
Read more >
How to Download And Use GPT3(GPT Neo) - YouTube
In this video, I go over how to download and run the open-source implementation of GPT3, called GPT Neo. This model is 2.7...
Read more >
GPT Neo(GPT 3): Running On A CPU Vs A GPU - YouTube
While you are able to run GPT Neo with just a CPU, do you want to ? In this video, I explore how...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found