Snapshot discovery and reading takes quadratic time
See original GitHub issueDescribe the bug
We’re using syrupy, and it works well. Thank you!
Unfortunately we have nearly 500 snapshots, and our tests runs are starting to get quite slow. It seems like syrupy unfortunately makes the testing take quadratic time with respect to the number of snapshots.
To reproduce
Create this file
# test_performance.py
import pytest
import os
SIZE = int(os.environ.get("SIZE", 1000))
@pytest.mark.parametrize("x", range(SIZE))
def test_performance(x, snapshot):
assert x == snapshot
# assert x == x
Run, for instance:
for s in 100 500 1000 2000; do
echo "size = $s"
# create the snapshots
SIZE=$s pytest test_performance.py --snapshot-update
# just check them
SIZE=$s pytest test_performance.py
done
The times reported by pytest scale scales quadratically with the number of tests/snapshots (O(size**2)). I think this is because the number of read_file
/_read_snapshot_fossil
calls and discover_snapshots
calls, as reported by python -m cProfile -m py.test test_performance.py
, scales linearly (O(size)) with the number of tests/snapshots, and the work required for each call also scales linearly (because the files contain O(size) data, for the snapshots).
The times and number of calls (this is just for the invocation that’s just checking snapshot) is something like:
size | time (seconds) | discover_snapshots calls |
read_file calls |
---|---|---|---|
100 | 0.15 | 200 | 300 |
500 | 1.73 | 1000 | 1500 |
1000 | 6.24 | 2000 | 3000 |
2000 | 21.84 | 4000 | 6000 |
Things of note:
- each doubling from 500 -> 1000 -> 2000 multiplies the time by 4, approximately (classic marker of quadratic performance)
- the number of calls seems rather large: 2 discovery calls per test/snapshot, and 3 read-file calls
Expected behavior
The test runs should be linear in the number of tests/snapshots. For instance, if the assert x == x
line is used (and the snapshot
fixture removed) instead of assert x == snapshot
, the test run is linear: even SIZE=10000
finishes in < 4s on my machine.
It seems like this could be handled by discovering the snapshots once (or once per file) and reading each snapshot file once too.
Screenshots
Environment (please complete the following information):
- OS: macOS
- Syrupy Version: 1.4
- Python Version: 3.8
Additional context
Issue Analytics
- State:
- Created 2 years ago
- Reactions:1
- Comments:9 (9 by maintainers)
Top GitHub Comments
Based on your metrics, it seems performance is under control now, or at a minimum it’s no longer quadratic, so I’ll close this issue. If you have other ideas/requests, we’re always open to contributors.
Yeah, unfortunately each file is generally O(number of assertions) size, because it stores info for each assertion, so when reads the whole file again it’s doing O(number of assertions) work (since it has to at least touch every byte in the file to find the names). That is, it’s O(number of assertions) assertions doing O(number of assertions) parsing work, leading to O(number of assertions**2) quadratic behaviour for each file.
The simple test in the issue is an extreme example, with up to 2000 assertions in a single file, but it has an impact even for our real-world ~450 snapshots spread across 25 files.
Here’s the test with
SIZE=10000
:(syrupy 1.4.3 seems to be approximately 52 × the time for
SIZE=2000
(8.24s), matching expectations for quadratic behaviour. The no syrupy version is the one described in the issue, removing thesnapshot
fixture and changing toassert x == x
.)Sorry, I’d prefer not to do so for now, but thanks for the invitation! 😄