Different GPT-2 outputs with mixed precision vs single precision
See original GitHub issueWhen using GPT-2 with mixed precision, the generated text is different from that produced by running it normally. This is true for both conditional and unconditional generation, and for top_k=1 (deterministic) and top_k=40. Typically the mixed precision and single precision outputs agree for a number of tokens and then begin to disagree (sometimes early, sometimes late).
Using GPT-2 with mixed precision would be useful to take advantage of the tensor cores on V100 and T4 GPUs.
Testing by calling model.half()
on GPT2LMHeadModel tends to start producing incorrect outputs early, while instead using Apex’s AMP usually produces correct outputs for a little longer but still generally deviates. My tests were on the 117M model, with Apex installed.
It surprises me that the top_k=1 results often differ, sometimes very early in the sequence. They only take the largest logits, so this means the ranking of the logits is different.
I think the cause is compounding errors in the “past” tensor used by the attention function. Each time a new token is generated, its past has some error in it. When subsequent token generations then use those values (in higher attention layers), their own pasts have more error. And so on, up through 16 layers for 117M or 24 for 345M. For cases where the top 2 logit values are almost the same, those 16 steps of error might be enough to change which one is larger and thereby change even the top_k=1 output. I haven’t verified this idea yet.
I’m not sure if this necessarily means the outputs will be qualitatively worse, but that’s a hard thing to measure.
Issue Analytics
- State:
- Created 4 years ago
- Comments:9 (1 by maintainers)
@Damiox While sampling with mixed precision gives different results, they seem to still be of high quality. I’ve been using mixed precision on talktotransformer.com for at least 6-7 months now and the quality has been excellent.
Currently generation only allows
batch_size=1