question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Changing a single example for BLOOM 176-B affects forward pass for other examples in a batch

See original GitHub issue

System Info

  • transformers version: 4.21.2
  • Platform: Linux-4.18.0-305.25.1.el8_4.x86_64-x86_64-with-glibc2.17
  • Python version: 3.8.13
  • Huggingface_hub version: 0.9.1
  • PyTorch version (GPU?): 1.11.0 (True)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using GPU in script?: yes
  • Using distributed or parallel set-up in script?: yes

Who can help?

@thomasw21, @younesbelkada This issue if for unexpected BLOOM outputs.

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, …)
  • My own task or dataset (give details below)

Reproduction

I wrote this script to do get the conditional NLL for the labels given the context. Tried different batches with only the first example changing and rest of the examples fixed in the batch. However, after a certain point, the changing of first examples, affects the NLL for other examples.

This is not supposed to happen.

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

model_name = "bigscience/bloom"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map="auto",
    max_memory={0: '0GIB', 1: '51GIB', 2: '51GIB', 3: '51GIB',
                4: '51GIB', 5: '51GIB', 6: '51GIB', 7: '51GIB'},
    torch_dtype=torch.bfloat16,
)

model.eval()

def compute_gen_loss(lm_logits: torch.Tensor, labels: torch.Tensor) -> torch.Tensor:
    batch_size = labels.shape[0]
    shift_logits = lm_logits[..., :-1, :].contiguous()
    shift_labels = labels[..., 1:].contiguous()

    loss_fct = torch.nn.CrossEntropyLoss(reduction="none")
    loss = loss_fct(
        shift_logits.view(-1, shift_logits.size(-1)),
        shift_labels.view(-1)
    )
    loss = loss.reshape(batch_size, -1)
    loss = loss.sum(dim=-1) / (shift_labels != -100).sum(dim=-1)
    return loss


def pad_ids(arrays, padding, max_length=-1):
    if (max_length < 0):
        max_length = max(list(map(len, arrays)))

    arrays = [[padding] * (max_length - len(array)) +
              array for array in arrays]

    return arrays


def forward(text: list, labels: str, conditional: bool = True):
    input_tokens = tokenizer(text).input_ids
    label_tokens = tokenizer(labels).input_ids

    input_ids = [x + y for (x, y) in zip(input_tokens, label_tokens)]
    attention_mask = [(len(x) + len(y)) * [1]
                      for (x, y) in zip(input_tokens, label_tokens)]
    if (conditional):
        labels = [[-100] * len(x) + y for (x, y)
                  in zip(input_tokens, label_tokens)]
    else:
        labels = input_ids

    pad = 3
    input_ids = pad_ids(input_ids, pad)
    attention_mask = pad_ids(attention_mask, 0)
    # labels need to be on output device
    labels = pad_ids(labels, -100)

    input_ids = torch.tensor(input_ids)
    attention_mask = torch.tensor(attention_mask)
    labels = torch.tensor(labels)
    lm_logits = model(
        input_ids=input_ids,
        attention_mask=attention_mask
    ).logits

    print(compute_gen_loss(lm_logits, labels).cpu().tolist())

text = [
    "DeepSpeed",
    "DeepSpeed is a",
    "DeepSpeed is a machine",
    "DeepSpeed is a machine learning framework",
]
labels = [
    " is awesome.",
    " good person.",
    " that can wipe out the planet.",
    " for generating memes.",
]
forward(text, labels)

labels[0] = " is awesome. really awesome"
forward(text, labels)

labels[0] = " is awesome. really awesome. Try it."
forward(text, labels)

labels[0] = " is awesome. really awesome. Try it. You'll be surprised"
forward(text, labels)

labels[0] = " is awesome. really awesome. Try it. You'll be surprised. BLOOM was trained using DeepSpeed."
forward(text, labels)

labels[0] = " is awesome. really awesome. Try it. You'll be surprised. BLOOM was trained using DeepSpeed. Oh no the values are bugging out now."
forward(text, labels)
[4.8125, 5.1875, 3.296875, 5.09375]
[5.625, 5.1875, 3.296875, 5.09375]
[4.375, 5.1875, 3.296875, 5.09375]
[4.0625, 5.1875, 3.28125, 5.09375]
[3.953125, 5.1875, 3.28125, 5.0625]
[4.25, 5.1875, 3.296875, 5.09375]

Value drops from 3.29 to 3.28 in column 2 when only example for column 0 is changed. Even column 3 changes in last case. Only column 0 is supposed to change here.

Expected behavior

[4.8125, 5.1875, 3.296875, 5.09375]
[5.625, 5.1875, 3.296875, 5.09375]
[4.375, 5.1875, 3.296875, 5.09375]
[4.0625, 5.1875, 3.296875, 5.09375]
[3.953125, 5.1875, 3.296875, 5.09375]
[4.25, 5.1875, 3.296875, 5.09375]

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:9 (4 by maintainers)

github_iconTop GitHub Comments

2reactions
thomasw21commented, Sep 14, 2022

Okay I think gpt2 test isn’t instability. Essentially it’s absolute positional embeddings that’s screwing with you as you move things to the right and adding padding to the left as you increase the label size, which is why you see big shifts in the loss.

I do think that the bloom test is instability. Typically 3.28125 and 3.296875 are consecutive.

>>> import torch
>>> torch.set_printoptions(precision=10)
>>> torch.frombuffer(bytes(np.array([83,64], np.int8)), dtype=torch.bfloat16)
tensor([3.2968750000], dtype=torch.bfloat16)
>>> torch.frombuffer(bytes(np.array([82,64], np.int8)), dtype=torch.bfloat16) # replace 83 with 82
tensor([3.2812500000], dtype=torch.bfloat16)

>>> torch.frombuffer(bytes(np.array([-94,64], np.int8)), dtype=torch.bfloat16)
tensor([5.0625000000], dtype=torch.bfloat16)
>>> torch.frombuffer(bytes(np.array([-93,64], np.int8)), dtype=torch.bfloat16)
tensor([5.0937500000], dtype=torch.bfloat16)

So as you said you can try computing the logits in fp32, which will increase precision (but will be slower). There’s a bit of a workaround as you need to cast the embedding layers to fp32 and such.

0reactions
github-actions[bot]commented, Oct 10, 2022

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Incredibly Fast BLOOM Inference with DeepSpeed and ...
This article shows how to get an incredibly fast per token throughput when generating with the 176B parameter BLOOM model.
Read more >
Petals: Collaborative Inference and Fine-tuning of Large Models
This method allows running LLMs with a single low-end accelerator by loading parameters from RAM justin-time for each forward pass. Offloading ...
Read more >
arXiv:2209.01188v1 [cs.LG] 2 Sep 2022
each forward pass. ... ently high latency: for example, generating one to- ken with BLOOM-176B takes at least 5.5 seconds.
Read more >
[Project] Run and fine-tune BLOOM-176B at home using a ...
Do you mean each chunk of work to do a forward pass is dispatched locally to a single GPU in sequence, but without...
Read more >
Distributed Inference and Fine-tuning of Large Language ...
Our system can inference BLOOM-176B over the Internet more than 10x faster ... models simultaneously without affecting each other's results.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found