Changing a single example for BLOOM 176-B affects forward pass for other examples in a batch
See original GitHub issueSystem Info
transformers
version: 4.21.2- Platform: Linux-4.18.0-305.25.1.el8_4.x86_64-x86_64-with-glibc2.17
- Python version: 3.8.13
- Huggingface_hub version: 0.9.1
- PyTorch version (GPU?): 1.11.0 (True)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using GPU in script?: yes
- Using distributed or parallel set-up in script?: yes
Who can help?
@thomasw21, @younesbelkada This issue if for unexpected BLOOM outputs.
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examples
folder (such as GLUE/SQuAD, …) - My own task or dataset (give details below)
Reproduction
I wrote this script to do get the conditional NLL for the labels given the context. Tried different batches with only the first example changing and rest of the examples fixed in the batch. However, after a certain point, the changing of first examples, affects the NLL for other examples.
This is not supposed to happen.
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
model_name = "bigscience/bloom"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name,
device_map="auto",
max_memory={0: '0GIB', 1: '51GIB', 2: '51GIB', 3: '51GIB',
4: '51GIB', 5: '51GIB', 6: '51GIB', 7: '51GIB'},
torch_dtype=torch.bfloat16,
)
model.eval()
def compute_gen_loss(lm_logits: torch.Tensor, labels: torch.Tensor) -> torch.Tensor:
batch_size = labels.shape[0]
shift_logits = lm_logits[..., :-1, :].contiguous()
shift_labels = labels[..., 1:].contiguous()
loss_fct = torch.nn.CrossEntropyLoss(reduction="none")
loss = loss_fct(
shift_logits.view(-1, shift_logits.size(-1)),
shift_labels.view(-1)
)
loss = loss.reshape(batch_size, -1)
loss = loss.sum(dim=-1) / (shift_labels != -100).sum(dim=-1)
return loss
def pad_ids(arrays, padding, max_length=-1):
if (max_length < 0):
max_length = max(list(map(len, arrays)))
arrays = [[padding] * (max_length - len(array)) +
array for array in arrays]
return arrays
def forward(text: list, labels: str, conditional: bool = True):
input_tokens = tokenizer(text).input_ids
label_tokens = tokenizer(labels).input_ids
input_ids = [x + y for (x, y) in zip(input_tokens, label_tokens)]
attention_mask = [(len(x) + len(y)) * [1]
for (x, y) in zip(input_tokens, label_tokens)]
if (conditional):
labels = [[-100] * len(x) + y for (x, y)
in zip(input_tokens, label_tokens)]
else:
labels = input_ids
pad = 3
input_ids = pad_ids(input_ids, pad)
attention_mask = pad_ids(attention_mask, 0)
# labels need to be on output device
labels = pad_ids(labels, -100)
input_ids = torch.tensor(input_ids)
attention_mask = torch.tensor(attention_mask)
labels = torch.tensor(labels)
lm_logits = model(
input_ids=input_ids,
attention_mask=attention_mask
).logits
print(compute_gen_loss(lm_logits, labels).cpu().tolist())
text = [
"DeepSpeed",
"DeepSpeed is a",
"DeepSpeed is a machine",
"DeepSpeed is a machine learning framework",
]
labels = [
" is awesome.",
" good person.",
" that can wipe out the planet.",
" for generating memes.",
]
forward(text, labels)
labels[0] = " is awesome. really awesome"
forward(text, labels)
labels[0] = " is awesome. really awesome. Try it."
forward(text, labels)
labels[0] = " is awesome. really awesome. Try it. You'll be surprised"
forward(text, labels)
labels[0] = " is awesome. really awesome. Try it. You'll be surprised. BLOOM was trained using DeepSpeed."
forward(text, labels)
labels[0] = " is awesome. really awesome. Try it. You'll be surprised. BLOOM was trained using DeepSpeed. Oh no the values are bugging out now."
forward(text, labels)
[4.8125, 5.1875, 3.296875, 5.09375]
[5.625, 5.1875, 3.296875, 5.09375]
[4.375, 5.1875, 3.296875, 5.09375]
[4.0625, 5.1875, 3.28125, 5.09375]
[3.953125, 5.1875, 3.28125, 5.0625]
[4.25, 5.1875, 3.296875, 5.09375]
Value drops from 3.29 to 3.28 in column 2 when only example for column 0 is changed. Even column 3 changes in last case. Only column 0 is supposed to change here.
Expected behavior
[4.8125, 5.1875, 3.296875, 5.09375]
[5.625, 5.1875, 3.296875, 5.09375]
[4.375, 5.1875, 3.296875, 5.09375]
[4.0625, 5.1875, 3.296875, 5.09375]
[3.953125, 5.1875, 3.296875, 5.09375]
[4.25, 5.1875, 3.296875, 5.09375]
Issue Analytics
- State:
- Created a year ago
- Comments:9 (4 by maintainers)
Top Results From Across the Web
Incredibly Fast BLOOM Inference with DeepSpeed and ...
This article shows how to get an incredibly fast per token throughput when generating with the 176B parameter BLOOM model.
Read more >Petals: Collaborative Inference and Fine-tuning of Large Models
This method allows running LLMs with a single low-end accelerator by loading parameters from RAM justin-time for each forward pass. Offloading ...
Read more >arXiv:2209.01188v1 [cs.LG] 2 Sep 2022
each forward pass. ... ently high latency: for example, generating one to- ken with BLOOM-176B takes at least 5.5 seconds.
Read more >[Project] Run and fine-tune BLOOM-176B at home using a ...
Do you mean each chunk of work to do a forward pass is dispatched locally to a single GPU in sequence, but without...
Read more >Distributed Inference and Fine-tuning of Large Language ...
Our system can inference BLOOM-176B over the Internet more than 10x faster ... models simultaneously without affecting each other's results.
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Okay I think gpt2 test isn’t instability. Essentially it’s absolute positional embeddings that’s screwing with you as you move things to the right and adding padding to the left as you increase the label size, which is why you see big shifts in the loss.
I do think that the bloom test is instability. Typically
3.28125
and3.296875
are consecutive.So as you said you can try computing the logits in fp32, which will increase precision (but will be slower). There’s a bit of a workaround as you need to cast the embedding layers to fp32 and such.
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.