leiden and umap not reproducible on different CPUs
See original GitHub issue- I have checked that this issue has not already been reported.
- I have confirmed this bug exists on the latest version of scanpy.
- (optional) I have confirmed this bug exists on the master branch of scanpy.
I noticed that running the same single-cell analyses on different nodes of our HPC produces different results.
Starting from the same anndata object with a precomputed X_scVI
latent representation, the UMAP and leiden-clustering looks different.
On
- Intel® Xeon® CPU E5-2699A v4 @ 2.40GHz
- AMD EPYC 7352 24-Core Processor
- Intel® Xeon® CPU E7-4850 v4 @ 2.10GHz
adata.obs["leiden"].value_counts()
0 4268
1 2132
2 1691
3 1662
4 1659
5 1563
...
On
- Intel® Xeon® CPU E7- 4870 @ 2.40GHz
0 3856
1 2168
2 2029
3 1659
4 1636
5 1536
...
Minimal code sample (that we can copy&paste without having any data)
A git repository with example data, notebook and a nextflow pipeline is available here: https://github.com/grst/scanpy_reproducibility
A report of the analysis executed on four different CPU architectures is available here: https://grst.github.io/scanpy_reproducibility/
Versions
WARNING: If you miss a compact list, please try `print_header`!
-----
anndata 0.7.5
scanpy 1.6.0
sinfo 0.3.1
-----
PIL 8.0.1
anndata 0.7.5
backcall 0.2.0
cairo 1.20.0
cffi 1.14.4
colorama 0.4.4
cycler 0.10.0
cython_runtime NA
dateutil 2.8.1
decorator 4.4.2
get_version 2.1
h5py 3.1.0
igraph 0.8.3
ipykernel 5.3.4
ipython_genutils 0.2.0
jedi 0.17.2
joblib 0.17.0
kiwisolver 1.3.1
legacy_api_wrap 0.0.0
leidenalg 0.8.3
llvmlite 0.35.0
matplotlib 3.3.3
mpl_toolkits NA
natsort 7.1.0
numba 0.52.0
numexpr 2.7.1
numpy 1.19.4
packaging 20.7
pandas 1.1.4
parso 0.7.1
pexpect 4.8.0
pickleshare 0.7.5
pkg_resources NA
prompt_toolkit 3.0.8
ptyprocess 0.6.0
pycparser 2.20
pygments 2.7.2
pyparsing 2.4.7
pytz 2020.4
scanpy 1.6.0
scipy 1.5.3
setuptools_scm NA
sinfo 0.3.1
six 1.15.0
sklearn 0.23.2
sphinxcontrib NA
storemagic NA
tables 3.6.1
texttable 1.6.3
tornado 6.1
traitlets 5.0.5
umap 0.4.6
wcwidth 0.2.5
yaml 5.3.1
zmq 20.0.0
-----
IPython 7.19.0
jupyter_client 6.1.7
jupyter_core 4.7.0
-----
Python 3.8.6 | packaged by conda-forge | (default, Nov 27 2020, 19:31:52) [GCC 9.3.0]
Linux-3.10.0-1160.11.1.el7.x86_64-x86_64-with-glibc2.10
64 logical CPU cores, x86_64
-----
Session information updated at 2021-10-15 09:58
Issue Analytics
- State:
- Created 2 years ago
- Comments:16 (15 by maintainers)
Top Results From Across the Web
leiden and umap not reproducible on different CPUs #2014
I noticed that running the same single-cell analyses on different nodes of our HPC produces different results. Starting from the same anndata ...
Read more >Gregor Sturm on Twitter: "Exactly same single cell analysis ran ...
Exactly same single cell analysis ran on two different CPUs. It's specious art nowadays anyway, so who cares. ... Both UMAP and leiden...
Read more >UMAP Reproducibility — umap 0.5 documentation
This means that different runs of UMAP can produce different results. UMAP is relatively stable – thus the variance between runs should ideally...
Read more >Normalize and compute highly variable genes
We don't run solo as part of the pipeline, as the results are not reproducible on different systems. Instead, we load pre-computed results...
Read more >mlf-core: a framework for deterministic machine learning - arXiv
Non -deterministic models lead to visibly different UMAP plots, which hinders comparison and thus reproducibility (Figure 4b). Finally, we performed.
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
That’s a good point, and it is not:
again produces stable results on only 3/4 CPUs.
Ok, let’s forget about UMAP. It’s only a nice figure to get an overview of the data and I don’t use it for downstream stuff. Irreproducible clustering, on the other hand, is quite a deal-breaker, as for instance cell-type annotations depend on it. I mean, why would I even bother releasing the source code of an analysis alongside the paper if it is not reproducible anyway?
I found out a few more things:
adata.obsp["connectivities"]
.pp.neighbors
withNUMBA_DISABLE_JIT=1
, the clustering is stable on all four nodes (but terribly slow, ofc)Mhm, certainly a cool option, but nothing that I could tackle in the next weeks due to time constraints. I would start here with a reproducibility section in the documentation and maybe a “deterministic” switch in the Scanpy settings which sets all required numba flags.