Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Different batch sizes lead to different inference results

See original GitHub issue

Hi,

I found that when setting load_in_8bit=True, different batch sizes will lead to very different results, even if I’m doing inference-only. I found this phenomenon for several HF pretrained language models with int8. A simple example is as follow, where I got very different results when comparing out1 and out2.

Thank you!

GPU: 1 RTX3090, Driver version: 470.103.01
CUDA SETUP: Highest compute capability among GPUs detected: 8.6
CUDA SETUP: Detected CUDA version 114

from transformers import GPT2Tokenizer, AutoModelForCausalLM
import torch

tokenizer = GPT2Tokenizer.from_pretrained("facebook/opt-1.3b")
model = AutoModelForCausalLM.from_pretrained("facebook/opt-1.3b",
	device_map='auto', load_in_8bit=True)

#model.cuda()
model.eval()


@torch.no_grad()
def do_inference(model, input_ids, attention_mask):
  outputs = model(input_ids=input_ids.cuda(), attention_mask=attention_mask.cuda())
  return outputs.logits.cpu()


batch_sents = [
 'Review: luminous interviews and amazingly evocative film from three decades ago \nSentiment:',
 'Review: with fewer gags to break the tedium \nSentiment:',
 'Review: aims for poetry and ends up sounding like satire \nSentiment:',
 'Review: no way original \nSentiment:'
 ]
enc_inputs = tokenizer(batch_sents, return_tensors='pt', padding=True)


# run inference with batch_size = 2
out1 = []
for i in range(0, len(batch_sents), 2):
  out = do_inference(model, enc_inputs['input_ids'][i:i+2], enc_inputs['attention_mask'][i:i+2])
  out1.append(out)
out1 = torch.cat(out1)

# run inference with batch_size = 4
out2 = do_inference(model, enc_inputs['input_ids'], enc_inputs['attention_mask'])

print(torch.abs(out1-out2).max()) #got tensor(2.0664, dtype=torch.float16) on my machine

Issue Analytics

State:
Created a year ago
Reactions:1
Comments:7 (1 by maintainers)

Top GitHub Comments

1reaction

terarachangcommented, Sep 6, 2022

Hi Tim,

Thank you so much for your reply.

I’d like to share my new findings. It seems like the results are dependent on the instances in the same batch:

# To avoid any potential issues in attention_mask, truncate all sents to the same length
enc_inputs = tokenizer(batch_sents, return_tensors='pt', max_length=5, truncation=True)

# run inference with batch_size = 2
out1 = []
for i in range(0, len(batch_sents), 2):
  out = do_inference(model, enc_inputs['input_ids'][i:i+2], enc_inputs['attention_mask'][i:i+2])
  out1.append(out)
out1 = torch.cat(out1)

# run inference with batch_size = 2 in a shuffled order
shuffled_order = [0,3,1,2]
input_ids_shf, attn_mask_shf = enc_inputs['input_ids'][shuffled_order], enc_inputs['attention_mask'][shuffled_order]

out4 = []
for i in range(0, len(batch_sents), 2):
  out = do_inference(model, input_ids_shf[i:i+2], attn_mask_shf[i:i+2])
  out4.append(out)
out4 = torch.cat(out4)

# shuffle back
ret_order = [0,2,3,1] # argsort(shuffled_order)
out4 = out4[ret_order]

print(torch.abs(out1-out4).max()) 
# got 1.0781 when load_in_8bit=True with the default threshold
# got 0 when load_in_8bit=True and int8_threshold = 0
# got 0 when load_in_8bit=False

A clarifying question: I wonder when I set int8_threshold=0, is it equivalent to the entire model in fp16 or in int8? My understanding is: hidden states values that are above this threshold are considered outliers and their operations will be done in fp16, so operation-wise, int8_threshold=0 is equivalent to the entire model in fp16, correct?

Thank you!

0reactions

mallorbccommented, Nov 29, 2022

I do not know what is expected behavior after seeing this occur without using int8. When I was doing batch processing for GPTJ, I was using bfloat16, which is not unstable like fp16 can be. I have not tried this with fp32 but bfloat16 should be a drop in replacement.