Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Different GPT-2 outputs with mixed precision vs single precision

See original GitHub issue

When using GPT-2 with mixed precision, the generated text is different from that produced by running it normally. This is true for both conditional and unconditional generation, and for top_k=1 (deterministic) and top_k=40. Typically the mixed precision and single precision outputs agree for a number of tokens and then begin to disagree (sometimes early, sometimes late).

Using GPT-2 with mixed precision would be useful to take advantage of the tensor cores on V100 and T4 GPUs.

Testing by calling model.half() on GPT2LMHeadModel tends to start producing incorrect outputs early, while instead using Apex’s AMP usually produces correct outputs for a little longer but still generally deviates. My tests were on the 117M model, with Apex installed.

It surprises me that the top_k=1 results often differ, sometimes very early in the sequence. They only take the largest logits, so this means the ranking of the logits is different.

I think the cause is compounding errors in the “past” tensor used by the attention function. Each time a new token is generated, its past has some error in it. When subsequent token generations then use those values (in higher attention layers), their own pasts have more error. And so on, up through 16 layers for 117M or 24 for 345M. For cases where the top 2 logit values are almost the same, those 16 steps of error might be enough to change which one is larger and thereby change even the top_k=1 output. I haven’t verified this idea yet.

I’m not sure if this necessarily means the outputs will be qualitatively worse, but that’s a hard thing to measure.

Issue Analytics

State:
Created 4 years ago
Comments:9 (1 by maintainers)

Top GitHub Comments

1reaction

AdamDanielKingcommented, Mar 9, 2020

@Damiox While sampling with mixed precision gives different results, they seem to still be of high quality. I’ve been using mixed precision on talktotransformer.com for at least 6-7 months now and the quality has been excellent.

0reactions

patrickvonplatencommented, Jun 4, 2020

Currently generation only allows batch_size=1