Document reproducibility guarantees
See original GitHub issueFollowing up on the discussion here, it would be good to document how to get reproducible results with UMAP.
I think we should consider changing random_state
in the UMAP constructor to a seed (e.g. 42, like the new transform_seed
default) so that UMAP is reproducible by default.
We should document that users can set random_state
to None
to get faster results at the expense of reproducibility. In this mode there is no seed that would produce the same output due to the multithreading. (This was introduced in #294.)
Issue Analytics
- State:
- Created 4 years ago
- Comments:15 (5 by maintainers)
Top Results From Across the Web
Reproducibility — H2O 3.38.0.3 documentation
The following criteria must be met to guarantee reproducibility in a multi-node cluster: Reproducible requirements for single node cluster are met.
Read more >Reproducibility vs. Replicability: A Brief History of a Confused ...
Electronic documents give reproducible research a new meaning. ... made by its manufacturer is not guaranteed or endorsed by the publisher.
Read more >What Does Reproducibility Mean in Science? - Orvium
Reproducibility means obtaining consistent results using the same data as the original study. Replicability means obtaining consistent results ...
Read more >5 – Reproducibility – Machine Learning Blog | ML@CMU
Reproducibility is important not just because it ensures that the results are correct, but also because it ensures transparency and gives us ...
Read more >Understanding Reproducibility and Replicability - NCBI - NIH
Rigor does not guarantee that a study will be replicated, but conducting a study with rigor—with a well-thought-out plan and strict adherence to...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
@tomwhite I agree with that statement, but I do believe there was some confusion (I think, in the end, I phrased the question badly). I am planning on putting together a notebook to go in the tutorial documentation that documents this clearly, and gives the justification for the choice made.
I haven’t looked at it closely again to be sure, but my understanding is that
parallel=True
is essentially going to be non-deterministic due to race conditions on updating the embedding. This is, I believe, the unknown behaviour you are thinking of. In practice everything is sparse and for large datasets the odds of race conditions causing actual issues are very low. This is essentially the benefit of the SGD rather than a standard GD.