Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[TODO] Investigate equivalence tests

See original GitHub issue

(add a lot of assignees just to make you informed and kept updated in the future. Don’t hesitate to remove yourself if you think it’s irrelevant)

Currently the PT/TF/Flax equivalence tests use 1e-5 as the tolerance for the absolute differences of outputs.

We see that these tests failed with a non-negligible (although not carefully defined) frequency.

Create this page to track a list of models to investigate.

FlaxWav2Vec2ModelTest (2.2888184e-05 > 1e-5)
- https://app.circleci.com/pipelines/github/huggingface/transformers/37363/workflows/a4b06424-0ba8-4fbc-9054-6ff52fbf8145/jobs/411654
TFGPT2EncoderDecoderModelTest (0.001009281724691391 > 1e-3)
- https://app.circleci.com/pipelines/github/huggingface/transformers/37358/workflows/43c12161-33d8-4df5-ba3c-3e62a4507ee7/jobs/411579
  - This also happens to TFBERTEncoderDecoderModelTest
  - This is caused by some sequence in a batch which gets all 0s as attention mask (generated by ids_tensor) - may happens on both encoder and decoder (especially after combining with the causal mask).
  - For TFBERTEncoderDecoderModelTest, the difference is smaller than TFGPT2EncoderDecoderModelTest (by a magnitude of 5x~10x) -> this is due to the last hidden states in GPT2 is after layer norm (not the case for BERT).
  - If we look the cross attention diff between PT/TF, it is clear that we have the same issue (both in the magnitude of 1e-3)
  - The encoder attention diff between PT/TF is in the magnitude of 5e-8: ~~not very sure why this doesn’t get much larger~~.
    - This is because PT/TF (at least in BERT) has different encoder_extended_attention_mask: 1e-4 vs 1e-9.
TFViTMAEModelTest (1.013279e-05 > 1e-5)
- https://app.circleci.com/pipelines/github/huggingface/transformers/37319/workflows/5adfba7a-d12b-4e1e-9a7a-e33c7d5fd6ee/jobs/411002

Issue Analytics

State:
Created a year ago
Reactions:1
Comments:5 (2 by maintainers)

Top GitHub Comments

1reaction

ydshiehcommented, Apr 13, 2022

Another one to add to this list: tests/funnel/test_modeling_funnel.py::FunnelModelTest::test_pt_tf_model_equivalence. I’ve been getting a failure in this one every other day – example: https://app.circleci.com/pipelines/github/huggingface/transformers/38007/workflows/2a98b7b1-5ad0-4b80-a702-1887c620193f/jobs/421265

(just for the record) Among 500 runs:

34 runs have FunnelForMaskedLM.output.logits at around 1e-5 ~ 2e-5: so ~ 6.8% chance of failure 😢
66 runs at around 9e-6
38 runs at around 8e-6

(so > 25% to get close to 1e-5)

1reaction

gantecommented, Apr 12, 2022

Another one to add to this list: tests/funnel/test_modeling_funnel.py::FunnelModelTest::test_pt_tf_model_equivalence. I’ve been getting a failure in this one every other day – example: https://app.circleci.com/pipelines/github/huggingface/transformers/38007/workflows/2a98b7b1-5ad0-4b80-a702-1887c620193f/jobs/421265