Different batch sizes lead to different inference results
See original GitHub issueHi,
I found that when setting load_in_8bit=True
, different batch sizes will lead to very different results, even if I’m doing inference-only. I found this phenomenon for several HF pretrained language models with int8.
A simple example is as follow, where I got very different results when comparing out1
and out2
.
Thank you!
GPU: 1 RTX3090, Driver version: 470.103.01
CUDA SETUP: Highest compute capability among GPUs detected: 8.6
CUDA SETUP: Detected CUDA version 114
from transformers import GPT2Tokenizer, AutoModelForCausalLM
import torch
tokenizer = GPT2Tokenizer.from_pretrained("facebook/opt-1.3b")
model = AutoModelForCausalLM.from_pretrained("facebook/opt-1.3b",
device_map='auto', load_in_8bit=True)
#model.cuda()
model.eval()
@torch.no_grad()
def do_inference(model, input_ids, attention_mask):
outputs = model(input_ids=input_ids.cuda(), attention_mask=attention_mask.cuda())
return outputs.logits.cpu()
batch_sents = [
'Review: luminous interviews and amazingly evocative film from three decades ago \nSentiment:',
'Review: with fewer gags to break the tedium \nSentiment:',
'Review: aims for poetry and ends up sounding like satire \nSentiment:',
'Review: no way original \nSentiment:'
]
enc_inputs = tokenizer(batch_sents, return_tensors='pt', padding=True)
# run inference with batch_size = 2
out1 = []
for i in range(0, len(batch_sents), 2):
out = do_inference(model, enc_inputs['input_ids'][i:i+2], enc_inputs['attention_mask'][i:i+2])
out1.append(out)
out1 = torch.cat(out1)
# run inference with batch_size = 4
out2 = do_inference(model, enc_inputs['input_ids'], enc_inputs['attention_mask'])
print(torch.abs(out1-out2).max()) #got tensor(2.0664, dtype=torch.float16) on my machine
Issue Analytics
- State:
- Created a year ago
- Reactions:1
- Comments:7 (1 by maintainers)
Top Results From Across the Web
The batch size can affect inference results | OpenReview
It experimentally finds that the different batch sizes during training and inference will affect the model performance due to the matrix ...
Read more >Different batch size different result with inception_v2 #11295
During testing, if the batchsize=128, everything is ok. However, if the batchsize is smaller than 128 the results are different.
Read more >Different batch sizes give different test accuracies
I am trying to test my model with different batch sizes and I am getting different accuracies for different batch sizes. here is...
Read more >Tensorflow Keras Different Inference Results Depending on ...
I believe it's commonly expected that batching can have some impact on results, with a risk of that impact being negative for larger...
Read more >How to use Different Batch Sizes when Training and ...
The batch size limits the number of samples to be shown to the network before a weight update can be performed. This same...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Hi Tim,
Thank you so much for your reply.
I’d like to share my new findings. It seems like the results are dependent on the instances in the same batch:
A clarifying question: I wonder when I set
int8_threshold=0
, is it equivalent to the entire model infp16
or inint8
? My understanding is: hidden states values that are above this threshold are considered outliers and their operations will be done in fp16, so operation-wise,int8_threshold=0
is equivalent to the entire model infp16
, correct?Thank you!
I do not know what is expected behavior after seeing this occur without using int8. When I was doing batch processing for GPTJ, I was using bfloat16, which is not unstable like fp16 can be. I have not tried this with fp32 but bfloat16 should be a drop in replacement.