Benchmark tests are flaky
See original GitHub issue🐛 Bug
I have seen the benchmark testing failing due to the following error quite a number of times:
> assert np.mean(diffs) < max_diff, f"Lightning diff {diffs} was worse than vanilla PT (threshold {max_diff})"
E AssertionError: Lightning diff [1.30557255e+00 3.60853071e-07] was worse than vanilla PT (threshold 0.0002)
E assert 0.6527864532870287 < 0.0002
E + where 0.6527864532870287 = <function mean at 0x7fcd121d6320>(array([1.30557255e+00, 3.60853071e-07]))
E + where <function mean at 0x7fcd121d6320> = np.mean
tests/benchmarks/test_basic_parity.py:38: AssertionError
which is configured at:
https://github.com/PyTorchLightning/pytorch-lightning/blob/bc1c8b926c5072f58f42ad4b7413a8ef5c904c85/.azure-pipelines/gpu-tests.yml#L121-L123 https://github.com/PyTorchLightning/pytorch-lightning/blob/bc1c8b926c5072f58f42ad4b7413a8ef5c904c85/.azure-pipelines/gpu-benchmark.yml#L39
To Reproduce
Expected behavior
No error raised so we can always keep our CI green 🟢
Environment
Additional context
Issue Analytics
- State:
- Created a year ago
- Comments:6 (6 by maintainers)
Top Results From Across the Web
What are Flaky Tests? | TeamCity CI/CD Guide - JetBrains
Flaky tests are tests that return new results, despite there being no changes to code. Find out why flaky tests matter and how...
Read more >Flaky Tests: Getting Rid Of A Living Nightmare In Testing
The Science Of Flaky Tests # A flaky test is one that fails to produce the same result each time the same analysis...
Read more >Flaky tests - GitLab Docs
It's a test that sometimes fails, but if you retry it enough times, it passes, eventually. What are the potential cause for a...
Read more >A Pragmatist's Guide to Flaky Test Management
A test is “flaky” whenever it can produce both “passing” and “failing” results for the same code. Test flakiness is a bit like...
Read more >How to reduce flaky test failures - CircleCI
Flaky tests result mostly from insufficient test data, narrow test environment scope, and complex technology. Some other factors that play a ...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found

Thank you @carmocca for asking! No, actually. I checked the benchmarks at the time of around the commit, and it turned out that we had never run the pure PyTorch benchmark as well as the Lightning benchmark with
benchmark=True. 00211c1 made only the Lightning benchmark run withbenchmark=True(as it became turned on by default), which seemingly led to more memory usage only in the Lightning benchmark.I will double-check by running both PyTorch and Lightning benchmarks with the flag turned on/off and will update here.
I ran the benchmark with
tests.helpers.advanced_models.ParityModuleCIFARagain (but this time withTrainer(benchmark=False)explicitly specified) and confirmed that there’s no difference in the memory usage before and after the commit 00211c1.I’ll conclude this issue by adding
Trainer(benchmark=False)to the existing benchmarks.