Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Non-deterministic behavior in featurization

See original GitHub issue

Describe the bug When working with large (~7k docs) corpus of hardware datasheets, extracting multiple relations, we expect that the features for each candidate would be deterministic between each run. Even more so if we have parallelism=1 set in the Featurizer. However, we find that there can be small (e.g., < 5) differences between feature tables, resulting in slightly different sparse matrices, and thus, slightly different results.

To Reproduce Running on the HACK transistor dataset will reproduce the error. However, it will take a long time, and we haven’t been able to get a very minimal example that reproduces the error yet. Attached are two feature table dumps between two different runs with parallelism=1. Note that there is only a single difference on line 65454.

feature_table.tar.gz

Note that it isn’t always one difference, and the difference is not deterministic. The different attached is just an example.

Expected behavior We would expect that these feature tables are identical between runs.

Error Logs/Screenshots For convenience, here is the differing line in screenshot form

Additional context If the issue is in the UDF implementation, this might affect the Labeler in addition to the Featurizer, since they share a lot of the UDF code.

Issue Analytics

State:
Created 3 years ago
Comments:8 (8 by maintainers)

Top GitHub Comments

1reaction

HiromuHotacommented, Jun 11, 2020

The number of contituent mentions of PartTemp: Part and Temp were the same on local Mac and GitHub Actions. This means temp_throttler drops one extra candidate on GitHub Actions.

def temp_throttler(c):
    (part, attr) = c
    if same_table((part, attr)):
        return is_horz_aligned((part, attr)) or is_vert_aligned((part, attr))
    return True

The difference comes either from same_table, or from is_horz_aligned/is_vert_aligned. I suspect the non-deterministic behaviour comes from the visual_linker.

0reactions

HiromuHotacommented, Jun 16, 2020

A few more updates:

In order to reproduce this, I had to delete and re-create a database (dropdb e2e_test and createdb e2e_test). (This is why I could not reproduce it on my local mac until I deleted and recreated the database). This would mean that the order of doc.sentences depends on some internal state of the postgres database.
Another observation that supports the above is that I have not observed this non-deterministic behaviour if a database is not used (ie just using UDFs).

Top Results From Across the Web

Non-Deterministic Behavior of Thompson Sampling with ...

We first study the root cause of the non-deterministic behavior. ... Representing the unification of text featurization using a context- free grammar.

Nondeterministic algorithm - Wikipedia

In computer programming, a nondeterministic algorithm is an algorithm that, even for the same input, can exhibit different behaviors on different runs, ...

Message passing neural network determinism thwarted by tf ...

segment_sum calls to get deterministic behavior with GPU. This led to comparable performance as the non deterministic version before!

A Workaround for Non-Determinism in TensorFlow - Two Sigma

As a final question, why does TensorFlow have non-deterministic behavior by default? Operations like reduce_sum can be faster than matmul since they rely...

Difference between Deterministic and Non ... - GeeksforGeeks

The non-deterministic algorithms can show different behaviors for the same input on different execution and there is a degree of randomness to ...