question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Non-deterministic behavior in featurization

See original GitHub issue

Describe the bug When working with large (~7k docs) corpus of hardware datasheets, extracting multiple relations, we expect that the features for each candidate would be deterministic between each run. Even more so if we have parallelism=1 set in the Featurizer. However, we find that there can be small (e.g., < 5) differences between feature tables, resulting in slightly different sparse matrices, and thus, slightly different results.

To Reproduce Running on the HACK transistor dataset will reproduce the error. However, it will take a long time, and we haven’t been able to get a very minimal example that reproduces the error yet. Attached are two feature table dumps between two different runs with parallelism=1. Note that there is only a single difference on line 65454.

feature_table.tar.gz

Note that it isn’t always one difference, and the difference is not deterministic. The different attached is just an example.

Expected behavior We would expect that these feature tables are identical between runs.

Error Logs/Screenshots For convenience, here is the differing line in screenshot form image

Additional context If the issue is in the UDF implementation, this might affect the Labeler in addition to the Featurizer, since they share a lot of the UDF code.

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:8 (8 by maintainers)

github_iconTop GitHub Comments

1reaction
HiromuHotacommented, Jun 11, 2020

The number of contituent mentions of PartTemp: Part and Temp were the same on local Mac and GitHub Actions. This means temp_throttler drops one extra candidate on GitHub Actions.

def temp_throttler(c):
    (part, attr) = c
    if same_table((part, attr)):
        return is_horz_aligned((part, attr)) or is_vert_aligned((part, attr))
    return True

The difference comes either from same_table, or from is_horz_aligned/is_vert_aligned. I suspect the non-deterministic behaviour comes from the visual_linker.

0reactions
HiromuHotacommented, Jun 16, 2020

A few more updates:

  • In order to reproduce this, I had to delete and re-create a database (dropdb e2e_test and createdb e2e_test). (This is why I could not reproduce it on my local mac until I deleted and recreated the database). This would mean that the order of doc.sentences depends on some internal state of the postgres database.
  • Another observation that supports the above is that I have not observed this non-deterministic behaviour if a database is not used (ie just using UDFs).
Read more comments on GitHub >

github_iconTop Results From Across the Web

Non-Deterministic Behavior of Thompson Sampling with ...
We first study the root cause of the non-deterministic behavior. ... Representing the unification of text featurization using a context- free grammar.
Read more >
Nondeterministic algorithm - Wikipedia
In computer programming, a nondeterministic algorithm is an algorithm that, even for the same input, can exhibit different behaviors on different runs, ...
Read more >
Message passing neural network determinism thwarted by tf ...
segment_sum calls to get deterministic behavior with GPU. This led to comparable performance as the non deterministic version before!
Read more >
A Workaround for Non-Determinism in TensorFlow - Two Sigma
As a final question, why does TensorFlow have non-deterministic behavior by default? Operations like reduce_sum can be faster than matmul since they rely...
Read more >
Difference between Deterministic and Non ... - GeeksforGeeks
The non-deterministic algorithms can show different behaviors for the same input on different execution and there is a degree of randomness to ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found