Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

leiden and umap not reproducible on different CPUs

See original GitHub issue

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of scanpy.
(optional) I have confirmed this bug exists on the master branch of scanpy.

I noticed that running the same single-cell analyses on different nodes of our HPC produces different results. Starting from the same anndata object with a precomputed X_scVI latent representation, the UMAP and leiden-clustering looks different.

Intel® Xeon® CPU E5-2699A v4 @ 2.40GHz
AMD EPYC 7352 24-Core Processor
Intel® Xeon® CPU E7-4850 v4 @ 2.10GHz

adata.obs["leiden"].value_counts()

Intel® Xeon® CPU E7- 4870 @ 2.40GHz

Minimal code sample (that we can copy&paste without having any data)

A git repository with example data, notebook and a nextflow pipeline is available here: https://github.com/grst/scanpy_reproducibility

A report of the analysis executed on four different CPU architectures is available here: https://grst.github.io/scanpy_reproducibility/

Versions

WARNING: If you miss a compact list, please try `print_header`!
-----
anndata     0.7.5
scanpy      1.6.0
sinfo       0.3.1
-----
PIL                 8.0.1
anndata             0.7.5
backcall            0.2.0
cairo               1.20.0
cffi                1.14.4
colorama            0.4.4
cycler              0.10.0
cython_runtime      NA
dateutil            2.8.1
decorator           4.4.2
get_version         2.1
h5py                3.1.0
igraph              0.8.3
ipykernel           5.3.4
ipython_genutils    0.2.0
jedi                0.17.2
joblib              0.17.0
kiwisolver          1.3.1
legacy_api_wrap     0.0.0
leidenalg           0.8.3
llvmlite            0.35.0
matplotlib          3.3.3
mpl_toolkits        NA
natsort             7.1.0
numba               0.52.0
numexpr             2.7.1
numpy               1.19.4
packaging           20.7
pandas              1.1.4
parso               0.7.1
pexpect             4.8.0
pickleshare         0.7.5
pkg_resources       NA
prompt_toolkit      3.0.8
ptyprocess          0.6.0
pycparser           2.20
pygments            2.7.2
pyparsing           2.4.7
pytz                2020.4
scanpy              1.6.0
scipy               1.5.3
setuptools_scm      NA
sinfo               0.3.1
six                 1.15.0
sklearn             0.23.2
sphinxcontrib       NA
storemagic          NA
tables              3.6.1
texttable           1.6.3
tornado             6.1
traitlets           5.0.5
umap                0.4.6
wcwidth             0.2.5
yaml                5.3.1
zmq                 20.0.0
-----
IPython             7.19.0
jupyter_client      6.1.7
jupyter_core        4.7.0
-----
Python 3.8.6 | packaged by conda-forge | (default, Nov 27 2020, 19:31:52) [GCC 9.3.0]
Linux-3.10.0-1160.11.1.el7.x86_64-x86_64-with-glibc2.10
64 logical CPU cores, x86_64
-----
Session information updated at 2021-10-15 09:58

Issue Analytics

State:
Created 2 years ago
Comments:16 (15 by maintainers)

Top GitHub Comments

1reaction

grstcommented, Oct 19, 2021

That’s a good point, and it is not:

reducer = umap.UMAP(min_dist=0.5)
embedding = reducer.fit_transform(adata.obsm["X_scVI"])
adata.obsm["X_umap"] = embedding

again produces stable results on only 3/4 CPUs.

Ok, let’s forget about UMAP. It’s only a nice figure to get an overview of the data and I don’t use it for downstream stuff. Irreproducible clustering, on the other hand, is quite a deal-breaker, as for instance cell-type annotations depend on it. I mean, why would I even bother releasing the source code of an analysis alongside the paper if it is not reproducible anyway?

I found out a few more things:

the leiden algorithm itself seems deterministic on all 4 nodes, when started from a pre-computed adata.obsp["connectivities"].
when running pp.neighbors with NUMBA_DISABLE_JIT=1, the clustering is stable on all four nodes (but terribly slow, ofc)
when rounding the connectivities to 3-4 digits, the clustering is also stable (plus the total runtime is reduced from 2:30 to 1:50min)

adata.obsp["connectivities"] = np.round(adata.obsp["connectivities"], decimals=3)
adata.obsp["distances"] = np.round(adata.obsp["distances"], decimals=3)

0reactions

Zethsoncommented, Oct 20, 2021

In addition to that, @Zethson, what do you think of creating a mlf-core template for single-cell analyses that sets the right defaults?

Mhm, certainly a cool option, but nothing that I could tackle in the next weeks due to time constraints. I would start here with a reproducibility section in the documentation and maybe a “deterministic” switch in the Scanpy settings which sets all required numba flags.

Top Results From Across the Web

leiden and umap not reproducible on different CPUs #2014

I noticed that running the same single-cell analyses on different nodes of our HPC produces different results. Starting from the same anndata ...

Gregor Sturm on Twitter: "Exactly same single cell analysis ran ...

Exactly same single cell analysis ran on two different CPUs. It's specious art nowadays anyway, so who cares. ... Both UMAP and leiden...

UMAP Reproducibility — umap 0.5 documentation

This means that different runs of UMAP can produce different results. UMAP is relatively stable – thus the variance between runs should ideally...

Normalize and compute highly variable genes

We don't run solo as part of the pipeline, as the results are not reproducible on different systems. Instead, we load pre-computed results...

mlf-core: a framework for deterministic machine learning - arXiv

Non -deterministic models lead to visibly different UMAP plots, which hinders comparison and thus reproducibility (Figure 4b). Finally, we performed.

Troubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.

Start Free

Top Related Reddit Thread

No results found

Top Related Tweet

No results found

Top Related Dev.to Post

No results found

leiden and umap not reproducible on different CPUs

Minimal code sample (that we can copy&paste without having any data)

Versions

Issue Analytics

Top GitHub Comments

Top Results From Across the Web

Top Related Medium Post

Top Related StackOverflow Question

Troubleshoot Live Code

Top Related Reddit Thread

Top Related Hackernoon Post

Top Related Tweet

Top Related Dev.to Post

Top Related Hashnode Post

scikit-misc cannot be initiated when sc.pp.highly_variable_genes(adata, n_top_genes=5000, flavor='seurat_v3')

ValueError: Length of values (1) does not match length of index with sc.pp.calculate_qc_metrics(adata)