Non-deterministic behavior in featurization
See original GitHub issueDescribe the bug
When working with large (~7k docs) corpus of hardware datasheets, extracting multiple relations, we expect that the features for each candidate would be deterministic between each run. Even more so if we have parallelism=1
set in the Featurizer. However, we find that there can be small (e.g., < 5) differences between feature tables, resulting in slightly different sparse matrices, and thus, slightly different results.
To Reproduce
Running on the HACK transistor dataset will reproduce the error. However, it will take a long time, and we haven’t been able to get a very minimal example that reproduces the error yet. Attached are two feature table dumps between two different runs with parallelism=1
. Note that there is only a single difference on line 65454
.
Note that it isn’t always one difference, and the difference is not deterministic. The different attached is just an example.
Expected behavior We would expect that these feature tables are identical between runs.
Error Logs/Screenshots
For convenience, here is the differing line in screenshot form
Additional context If the issue is in the UDF implementation, this might affect the Labeler in addition to the Featurizer, since they share a lot of the UDF code.
Issue Analytics
- State:
- Created 3 years ago
- Comments:8 (8 by maintainers)
The number of contituent mentions of
PartTemp
:Part
andTemp
were the same on local Mac and GitHub Actions. This meanstemp_throttler
drops one extra candidate on GitHub Actions.The difference comes either from
same_table
, or fromis_horz_aligned/is_vert_aligned
. I suspect the non-deterministic behaviour comes from the visual_linker.A few more updates:
dropdb e2e_test
andcreatedb e2e_test
). (This is why I could not reproduce it on my local mac until I deleted and recreated the database). This would mean that the order ofdoc.sentences
depends on some internal state of the postgres database.