[TODO] Investigate equivalence tests
See original GitHub issue(add a lot of assignees just to make you informed and kept updated in the future. Don’t hesitate to remove yourself if you think it’s irrelevant)
Currently the PT/TF/Flax equivalence tests use 1e-5
as the tolerance for the absolute differences of outputs.
We see that these tests failed with a non-negligible (although not carefully defined) frequency.
Create this page to track a list of models to investigate.
-
FlaxWav2Vec2ModelTest (2.2888184e-05 > 1e-5)
-
TFGPT2EncoderDecoderModelTest (0.001009281724691391 > 1e-3)
- https://app.circleci.com/pipelines/github/huggingface/transformers/37358/workflows/43c12161-33d8-4df5-ba3c-3e62a4507ee7/jobs/411579
- This also happens to TFBERTEncoderDecoderModelTest
- This is caused by some sequence in a batch which gets all 0s as attention mask (generated by ids_tensor) - may happens on both encoder and decoder (especially after combining with the causal mask).
- For TFBERTEncoderDecoderModelTest, the difference is smaller than TFGPT2EncoderDecoderModelTest (by a magnitude of 5x~10x) -> this is due to the last hidden states in GPT2 is after layer norm (not the case for BERT).
- If we look the cross attention diff between PT/TF, it is clear that we have the same issue (both in the magnitude of
1e-3
) - The encoder attention diff between PT/TF is in the magnitude of
5e-8
:not very sure why this doesn’t get much larger.- This is because PT/TF (at least in BERT) has different
encoder_extended_attention_mask
:1e-4
vs1e-9
.
- This is because PT/TF (at least in BERT) has different
- https://app.circleci.com/pipelines/github/huggingface/transformers/37358/workflows/43c12161-33d8-4df5-ba3c-3e62a4507ee7/jobs/411579
-
TFViTMAEModelTest (1.013279e-05 > 1e-5)
Issue Analytics
- State:
- Created a year ago
- Reactions:1
- Comments:5 (2 by maintainers)
Top Results From Across the Web
Equivalence Tests - PMC - NCBI - NIH
Equivalence testing invites researchers to make more specific predictions about the effect size they find worthwhile to examine. Bayesian ...
Read more >Equivalence Testing for Psychological Research: A Tutorial
Equivalence tests can be seen as the opposite of minimal effects tests: They examine. 51 whether the presence of effects that are large...
Read more >Equivalence Testing
Examine the Data and Calculate the p-value . ... Equivalence testing is an adjustment to this process to determine if the source populations ......
Read more >COVID-19 Resources
OptumServe COVID Testing Locations NOW Offering COVID Vaccinations ... full-time equivalent employees or individuals who are self-employed.
Read more >Consolidate Duplicate URLs with Canonical Tags
Google prefers HTTPS pages over equivalent HTTP pages as canonical, except when there are issues or conflicting signals such as the following:.
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
(just for the record) Among
500
runs:FunnelForMaskedLM.output.logits
at around1e-5
~2e-5
: so ~6.8%
chance of failure 😢9e-6
8e-6
(so > 25% to get close to
1e-5
)Another one to add to this list:
tests/funnel/test_modeling_funnel.py::FunnelModelTest::test_pt_tf_model_equivalence
. I’ve been getting a failure in this one every other day – example: https://app.circleci.com/pipelines/github/huggingface/transformers/38007/workflows/2a98b7b1-5ad0-4b80-a702-1887c620193f/jobs/421265