question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

leiden and umap not reproducible on different CPUs

See original GitHub issue
  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of scanpy.
  • (optional) I have confirmed this bug exists on the master branch of scanpy.

I noticed that running the same single-cell analyses on different nodes of our HPC produces different results. Starting from the same anndata object with a precomputed X_scVI latent representation, the UMAP and leiden-clustering looks different.

On

  • Intel® Xeon® CPU E5-2699A v4 @ 2.40GHz
  • AMD EPYC 7352 24-Core Processor
  • Intel® Xeon® CPU E7-4850 v4 @ 2.10GHz

image

adata.obs["leiden"].value_counts()
0     4268
1     2132
2     1691
3     1662
4     1659
5     1563
...

On

  • Intel® Xeon® CPU E7- 4870 @ 2.40GHz

image

0     3856
1     2168
2     2029
3     1659
4     1636
5     1536
...

Minimal code sample (that we can copy&paste without having any data)

A git repository with example data, notebook and a nextflow pipeline is available here: https://github.com/grst/scanpy_reproducibility

A report of the analysis executed on four different CPU architectures is available here: https://grst.github.io/scanpy_reproducibility/

Versions

WARNING: If you miss a compact list, please try `print_header`!
-----
anndata     0.7.5
scanpy      1.6.0
sinfo       0.3.1
-----
PIL                 8.0.1
anndata             0.7.5
backcall            0.2.0
cairo               1.20.0
cffi                1.14.4
colorama            0.4.4
cycler              0.10.0
cython_runtime      NA
dateutil            2.8.1
decorator           4.4.2
get_version         2.1
h5py                3.1.0
igraph              0.8.3
ipykernel           5.3.4
ipython_genutils    0.2.0
jedi                0.17.2
joblib              0.17.0
kiwisolver          1.3.1
legacy_api_wrap     0.0.0
leidenalg           0.8.3
llvmlite            0.35.0
matplotlib          3.3.3
mpl_toolkits        NA
natsort             7.1.0
numba               0.52.0
numexpr             2.7.1
numpy               1.19.4
packaging           20.7
pandas              1.1.4
parso               0.7.1
pexpect             4.8.0
pickleshare         0.7.5
pkg_resources       NA
prompt_toolkit      3.0.8
ptyprocess          0.6.0
pycparser           2.20
pygments            2.7.2
pyparsing           2.4.7
pytz                2020.4
scanpy              1.6.0
scipy               1.5.3
setuptools_scm      NA
sinfo               0.3.1
six                 1.15.0
sklearn             0.23.2
sphinxcontrib       NA
storemagic          NA
tables              3.6.1
texttable           1.6.3
tornado             6.1
traitlets           5.0.5
umap                0.4.6
wcwidth             0.2.5
yaml                5.3.1
zmq                 20.0.0
-----
IPython             7.19.0
jupyter_client      6.1.7
jupyter_core        4.7.0
-----
Python 3.8.6 | packaged by conda-forge | (default, Nov 27 2020, 19:31:52) [GCC 9.3.0]
Linux-3.10.0-1160.11.1.el7.x86_64-x86_64-with-glibc2.10
64 logical CPU cores, x86_64
-----
Session information updated at 2021-10-15 09:58

Issue Analytics

  • State:open
  • Created 2 years ago
  • Comments:16 (15 by maintainers)

github_iconTop GitHub Comments

1reaction
grstcommented, Oct 19, 2021

That’s a good point, and it is not:

reducer = umap.UMAP(min_dist=0.5)
embedding = reducer.fit_transform(adata.obsm["X_scVI"])
adata.obsm["X_umap"] = embedding

again produces stable results on only 3/4 CPUs.

Ok, let’s forget about UMAP. It’s only a nice figure to get an overview of the data and I don’t use it for downstream stuff. Irreproducible clustering, on the other hand, is quite a deal-breaker, as for instance cell-type annotations depend on it. I mean, why would I even bother releasing the source code of an analysis alongside the paper if it is not reproducible anyway?

I found out a few more things:

  • the leiden algorithm itself seems deterministic on all 4 nodes, when started from a pre-computed adata.obsp["connectivities"].
  • when running pp.neighbors with NUMBA_DISABLE_JIT=1, the clustering is stable on all four nodes (but terribly slow, ofc)
  • when rounding the connectivities to 3-4 digits, the clustering is also stable (plus the total runtime is reduced from 2:30 to 1:50min)
adata.obsp["connectivities"] = np.round(adata.obsp["connectivities"], decimals=3)
adata.obsp["distances"] = np.round(adata.obsp["distances"], decimals=3)
0reactions
Zethsoncommented, Oct 20, 2021

In addition to that, @Zethson, what do you think of creating a mlf-core template for single-cell analyses that sets the right defaults?

Mhm, certainly a cool option, but nothing that I could tackle in the next weeks due to time constraints. I would start here with a reproducibility section in the documentation and maybe a “deterministic” switch in the Scanpy settings which sets all required numba flags.

Read more comments on GitHub >

github_iconTop Results From Across the Web

leiden and umap not reproducible on different CPUs #2014
I noticed that running the same single-cell analyses on different nodes of our HPC produces different results. Starting from the same anndata ...
Read more >
Gregor Sturm on Twitter: "Exactly same single cell analysis ran ...
Exactly same single cell analysis ran on two different CPUs. It's specious art nowadays anyway, so who cares. ... Both UMAP and leiden...
Read more >
UMAP Reproducibility — umap 0.5 documentation
This means that different runs of UMAP can produce different results. UMAP is relatively stable – thus the variance between runs should ideally...
Read more >
Normalize and compute highly variable genes
We don't run solo as part of the pipeline, as the results are not reproducible on different systems. Instead, we load pre-computed results...
Read more >
mlf-core: a framework for deterministic machine learning - arXiv
Non -deterministic models lead to visibly different UMAP plots, which hinders comparison and thus reproducibility (Figure 4b). Finally, we performed.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found