Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Changing a single example for BLOOM 176-B affects forward pass for other examples in a batch

See original GitHub issue

System Info

transformers version: 4.21.2
Platform: Linux-4.18.0-305.25.1.el8_4.x86_64-x86_64-with-glibc2.17
Python version: 3.8.13
Huggingface_hub version: 0.9.1
PyTorch version (GPU?): 1.11.0 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?: yes
Using distributed or parallel set-up in script?: yes

Who can help?

@thomasw21, @younesbelkada This issue if for unexpected BLOOM outputs.

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, …)
My own task or dataset (give details below)

Reproduction

I wrote this script to do get the conditional NLL for the labels given the context. Tried different batches with only the first example changing and rest of the examples fixed in the batch. However, after a certain point, the changing of first examples, affects the NLL for other examples.

This is not supposed to happen.

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

model_name = "bigscience/bloom"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map="auto",
    max_memory={0: '0GIB', 1: '51GIB', 2: '51GIB', 3: '51GIB',
                4: '51GIB', 5: '51GIB', 6: '51GIB', 7: '51GIB'},
    torch_dtype=torch.bfloat16,
)

model.eval()

def compute_gen_loss(lm_logits: torch.Tensor, labels: torch.Tensor) -> torch.Tensor:
    batch_size = labels.shape[0]
    shift_logits = lm_logits[..., :-1, :].contiguous()
    shift_labels = labels[..., 1:].contiguous()

    loss_fct = torch.nn.CrossEntropyLoss(reduction="none")
    loss = loss_fct(
        shift_logits.view(-1, shift_logits.size(-1)),
        shift_labels.view(-1)
    )
    loss = loss.reshape(batch_size, -1)
    loss = loss.sum(dim=-1) / (shift_labels != -100).sum(dim=-1)
    return loss


def pad_ids(arrays, padding, max_length=-1):
    if (max_length < 0):
        max_length = max(list(map(len, arrays)))

    arrays = [[padding] * (max_length - len(array)) +
              array for array in arrays]

    return arrays


def forward(text: list, labels: str, conditional: bool = True):
    input_tokens = tokenizer(text).input_ids
    label_tokens = tokenizer(labels).input_ids

    input_ids = [x + y for (x, y) in zip(input_tokens, label_tokens)]
    attention_mask = [(len(x) + len(y)) * [1]
                      for (x, y) in zip(input_tokens, label_tokens)]
    if (conditional):
        labels = [[-100] * len(x) + y for (x, y)
                  in zip(input_tokens, label_tokens)]
    else:
        labels = input_ids

    pad = 3
    input_ids = pad_ids(input_ids, pad)
    attention_mask = pad_ids(attention_mask, 0)
    # labels need to be on output device
    labels = pad_ids(labels, -100)

    input_ids = torch.tensor(input_ids)
    attention_mask = torch.tensor(attention_mask)
    labels = torch.tensor(labels)
    lm_logits = model(
        input_ids=input_ids,
        attention_mask=attention_mask
    ).logits

    print(compute_gen_loss(lm_logits, labels).cpu().tolist())

text = [
    "DeepSpeed",
    "DeepSpeed is a",
    "DeepSpeed is a machine",
    "DeepSpeed is a machine learning framework",
]
labels = [
    " is awesome.",
    " good person.",
    " that can wipe out the planet.",
    " for generating memes.",
]
forward(text, labels)

labels[0] = " is awesome. really awesome"
forward(text, labels)

labels[0] = " is awesome. really awesome. Try it."
forward(text, labels)

labels[0] = " is awesome. really awesome. Try it. You'll be surprised"
forward(text, labels)

labels[0] = " is awesome. really awesome. Try it. You'll be surprised. BLOOM was trained using DeepSpeed."
forward(text, labels)

labels[0] = " is awesome. really awesome. Try it. You'll be surprised. BLOOM was trained using DeepSpeed. Oh no the values are bugging out now."
forward(text, labels)

[4.8125, 5.1875, 3.296875, 5.09375]
[5.625, 5.1875, 3.296875, 5.09375]
[4.375, 5.1875, 3.296875, 5.09375]
[4.0625, 5.1875, 3.28125, 5.09375]
[3.953125, 5.1875, 3.28125, 5.0625]
[4.25, 5.1875, 3.296875, 5.09375]

Value drops from 3.29 to 3.28 in column 2 when only example for column 0 is changed. Even column 3 changes in last case. Only column 0 is supposed to change here.

Expected behavior

[4.8125, 5.1875, 3.296875, 5.09375]
[5.625, 5.1875, 3.296875, 5.09375]
[4.375, 5.1875, 3.296875, 5.09375]
[4.0625, 5.1875, 3.296875, 5.09375]
[3.953125, 5.1875, 3.296875, 5.09375]
[4.25, 5.1875, 3.296875, 5.09375]

Issue Analytics

State:
Created a year ago
Comments:9 (4 by maintainers)

Top GitHub Comments

2reactions

thomasw21commented, Sep 14, 2022

Okay I think gpt2 test isn’t instability. Essentially it’s absolute positional embeddings that’s screwing with you as you move things to the right and adding padding to the left as you increase the label size, which is why you see big shifts in the loss.

I do think that the bloom test is instability. Typically 3.28125 and 3.296875 are consecutive.

>>> import torch
>>> torch.set_printoptions(precision=10)
>>> torch.frombuffer(bytes(np.array([83,64], np.int8)), dtype=torch.bfloat16)
tensor([3.2968750000], dtype=torch.bfloat16)
>>> torch.frombuffer(bytes(np.array([82,64], np.int8)), dtype=torch.bfloat16) # replace 83 with 82
tensor([3.2812500000], dtype=torch.bfloat16)

>>> torch.frombuffer(bytes(np.array([-94,64], np.int8)), dtype=torch.bfloat16)
tensor([5.0625000000], dtype=torch.bfloat16)
>>> torch.frombuffer(bytes(np.array([-93,64], np.int8)), dtype=torch.bfloat16)
tensor([5.0937500000], dtype=torch.bfloat16)

So as you said you can try computing the logits in fp32, which will increase precision (but will be slower). There’s a bit of a workaround as you need to cast the embedding layers to fp32 and such.

0reactions

github-actions[bot]commented, Oct 10, 2022

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.