question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Different batch sizes lead to different inference results

See original GitHub issue

Hi,

I found that when setting load_in_8bit=True, different batch sizes will lead to very different results, even if I’m doing inference-only. I found this phenomenon for several HF pretrained language models with int8. A simple example is as follow, where I got very different results when comparing out1 and out2.

Thank you!

GPU: 1 RTX3090, Driver version: 470.103.01
CUDA SETUP: Highest compute capability among GPUs detected: 8.6
CUDA SETUP: Detected CUDA version 114
from transformers import GPT2Tokenizer, AutoModelForCausalLM
import torch

tokenizer = GPT2Tokenizer.from_pretrained("facebook/opt-1.3b")
model = AutoModelForCausalLM.from_pretrained("facebook/opt-1.3b",
	device_map='auto', load_in_8bit=True)

#model.cuda()
model.eval()


@torch.no_grad()
def do_inference(model, input_ids, attention_mask):
  outputs = model(input_ids=input_ids.cuda(), attention_mask=attention_mask.cuda())
  return outputs.logits.cpu()


batch_sents = [
 'Review: luminous interviews and amazingly evocative film from three decades ago \nSentiment:',
 'Review: with fewer gags to break the tedium \nSentiment:',
 'Review: aims for poetry and ends up sounding like satire \nSentiment:',
 'Review: no way original \nSentiment:'
 ]
enc_inputs = tokenizer(batch_sents, return_tensors='pt', padding=True)


# run inference with batch_size = 2
out1 = []
for i in range(0, len(batch_sents), 2):
  out = do_inference(model, enc_inputs['input_ids'][i:i+2], enc_inputs['attention_mask'][i:i+2])
  out1.append(out)
out1 = torch.cat(out1)

# run inference with batch_size = 4
out2 = do_inference(model, enc_inputs['input_ids'], enc_inputs['attention_mask'])

print(torch.abs(out1-out2).max()) #got tensor(2.0664, dtype=torch.float16) on my machine

Issue Analytics

  • State:open
  • Created a year ago
  • Reactions:1
  • Comments:7 (1 by maintainers)

github_iconTop GitHub Comments

1reaction
terarachangcommented, Sep 6, 2022

Hi Tim,

Thank you so much for your reply.

I’d like to share my new findings. It seems like the results are dependent on the instances in the same batch:

# To avoid any potential issues in attention_mask, truncate all sents to the same length
enc_inputs = tokenizer(batch_sents, return_tensors='pt', max_length=5, truncation=True)

# run inference with batch_size = 2
out1 = []
for i in range(0, len(batch_sents), 2):
  out = do_inference(model, enc_inputs['input_ids'][i:i+2], enc_inputs['attention_mask'][i:i+2])
  out1.append(out)
out1 = torch.cat(out1)

# run inference with batch_size = 2 in a shuffled order
shuffled_order = [0,3,1,2]
input_ids_shf, attn_mask_shf = enc_inputs['input_ids'][shuffled_order], enc_inputs['attention_mask'][shuffled_order]

out4 = []
for i in range(0, len(batch_sents), 2):
  out = do_inference(model, input_ids_shf[i:i+2], attn_mask_shf[i:i+2])
  out4.append(out)
out4 = torch.cat(out4)

# shuffle back
ret_order = [0,2,3,1] # argsort(shuffled_order)
out4 = out4[ret_order]

print(torch.abs(out1-out4).max()) 
# got 1.0781 when load_in_8bit=True with the default threshold
# got 0 when load_in_8bit=True and int8_threshold = 0
# got 0 when load_in_8bit=False

A clarifying question: I wonder when I set int8_threshold=0, is it equivalent to the entire model in fp16 or in int8? My understanding is: hidden states values that are above this threshold are considered outliers and their operations will be done in fp16, so operation-wise, int8_threshold=0 is equivalent to the entire model in fp16, correct?

Thank you!

0reactions
mallorbccommented, Nov 29, 2022

I do not know what is expected behavior after seeing this occur without using int8. When I was doing batch processing for GPTJ, I was using bfloat16, which is not unstable like fp16 can be. I have not tried this with fp32 but bfloat16 should be a drop in replacement.

Read more comments on GitHub >

github_iconTop Results From Across the Web

The batch size can affect inference results | OpenReview
It experimentally finds that the different batch sizes during training and inference will affect the model performance due to the matrix ...
Read more >
Different batch size different result with inception_v2 #11295
During testing, if the batchsize=128, everything is ok. However, if the batchsize is smaller than 128 the results are different.
Read more >
Different batch sizes give different test accuracies
I am trying to test my model with different batch sizes and I am getting different accuracies for different batch sizes. here is...
Read more >
Tensorflow Keras Different Inference Results Depending on ...
I believe it's commonly expected that batching can have some impact on results, with a risk of that impact being negative for larger...
Read more >
How to use Different Batch Sizes when Training and ...
The batch size limits the number of samples to be shown to the network before a weight update can be performed. This same...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found