question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Inconsistent results from arpack pca implementation using VMs with different numbers of cpus

See original GitHub issue

I am finding that my analysis is not perfectly reproducible across different computational platforms. I thought I was going crazy but I have since reproduced this finding using the minimal 3000 PBMC dataset clustering example. Essentially I run the same code either on a virtual machine with 8 CPUs or one with 16 CPUs and I get non-identical PCA results. It doesn’t seem to matter if I use the arpack or the randomized solver even though using the randomized solver gives the warning:

Note that scikit-learn's randomized PCA might not be exactly reproducible across different computational platforms. For exact reproducibility, choose svd_solver=‘arpack’. This will likely become the Scanpy default in the future.

I’d like to just attach the jupyter notebook but it won’t seem to let me do that so I’m copying the code below. …

# First run on a machine with 8 CPUs
import numpy as np
import pandas as pd
import scanpy as sc
adata = sc.read_10x_mtx(
    './data/filtered_gene_bc_matrices/hg19/', 
    var_names='gene_symbols',
    cache=True) 

sc.pp.filter_cells(adata, min_genes=200)
sc.pp.filter_genes(adata, min_cells=3)
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)
adata = adata.copy()
sc.pp.scale(adata, max_value=10)
sc.pp.highly_variable_genes(adata, min_mean=0.0125, max_mean=3, min_disp=0.5)
adata = adata[:, adata.var.highly_variable]
sc.tl.pca(adata, svd_solver='arpack', random_state=14)
sc.pp.neighbors(adata, n_neighbors=10, n_pcs=40, random_state=14)
sc.write('test8.h5ad', adata)
sc.tl.pca(adata, svd_solver='randomized', random_state=14)
sc.pp.neighbors(adata, n_neighbors=10, n_pcs=40, random_state=14)
sc.write('test8_randomized.h5ad', adata)

# Then run on a machine with 16 CPUs
import numpy as np
import pandas as pd
import scanpy as sc
adata = sc.read_10x_mtx(
    './data/filtered_gene_bc_matrices/hg19/', 
    var_names='gene_symbols',
    cache=True) 

sc.pp.filter_cells(adata, min_genes=200)
sc.pp.filter_genes(adata, min_cells=3)
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)
adata = adata.copy()
sc.pp.scale(adata, max_value=10)
sc.pp.highly_variable_genes(adata, min_mean=0.0125, max_mean=3, min_disp=0.5)
adata = adata[:, adata.var.highly_variable]
sc.tl.pca(adata, svd_solver='arpack', random_state=14)
sc.pp.neighbors(adata, n_neighbors=10, n_pcs=40, random_state=14)
sc.write('test16.h5ad', adata)
sc.tl.pca(adata, svd_solver='randomized', random_state=14)
sc.pp.neighbors(adata, n_neighbors=10, n_pcs=40, random_state=14)
sc.write('test16_randomized.h5ad', adata)

# Running on a machine with 16 CPUs, evaluate the differences between the results first from the arpack solver
adata8 = sc.read('test8.h5ad')
adata16 = sc.read('test16.h5ad')
print((adata8.X != adata16.X).sum())
print((adata8.obsm['X_pca'] != adata16.obsm['X_pca']).sum())
print((adata8.uns['neighbors']['connectivities'] != adata16.uns['neighbors']['connectivities']).sum())
sc.tl.leiden(adata8, random_state=14)
sc.tl.leiden(adata16, random_state=14)
display(adata8.obs['leiden'].value_counts())
display(adata16.obs['leiden'].value_counts())

# Running on a machine with 16 CPUs, evaluate the differences between the results from the randomized solver
adata8 = sc.read('test8_randomized.h5ad')
adata16 = sc.read('test16_randomized.h5ad')
print((adata8.X != adata16.X).sum())
print((adata8.obsm['X_pca'] != adata16.obsm['X_pca']).sum())
print((adata8.uns['neighbors']['connectivities'] != adata16.uns['neighbors']['connectivities']).sum())
sc.tl.leiden(adata8, random_state=14)
sc.tl.leiden(adata16, random_state=14)
display(adata8.obs['leiden'].value_counts())
display(adata16.obs['leiden'].value_counts())

This outputs the following

0
134513
37696
0    659
1    605
2    398
3    352
4    342
5    174
6    118
7     41
8     11
Name: leiden, dtype: int64
0    527
1    484
2    398
3    324
4    320
5    301
6    174
7    109
8     52
9     11
Name: leiden, dtype: int64

0
134127
37278
0    646
1    617
2    382
3    362
4    334
5    173
6    129
7     46
8     11
Name: leiden, dtype: int64
0    646
1    631
2    408
3    349
4    334
5    170
6    106
7     45
8     11
Name: leiden, dtype: int64

...

Versions:

scanpy==1.4.4.post1 anndata==0.7.1 umap==0.3.10 numpy==1.18.1 scipy==1.4.1 pandas==1.0.3 scikit-learn==0.22.2.post1 statsmodels==0.11.1 python-igraph==0.8.0 louvain==0.6.1

Issue Analytics

  • State:open
  • Created 3 years ago
  • Comments:11 (6 by maintainers)

github_iconTop GitHub Comments

1reaction
ivirshupcommented, Apr 30, 2020

IIRC, you can limit the number of CPUs used through blas. This works on my machine:

export OMP_NUM_THREADS=1

Different blas libraries use different environment variables for this, so I’d check to make sure it’s actually restricting the number of threads used.

1reaction
dylkotcommented, Apr 28, 2020

Right, I think it makes sense that this would happen in a docker container based on what I’m seeing. I’ll let you know if I find a solution!

Read more comments on GitHub >

github_iconTop Results From Across the Web

Inconsistent results from arpack pca implementation using ...
I am just loading them up with different virtual machines with different numbers of CPUs. In both cases the CPUs are Intel Xeon...
Read more >
python - fit_transform PCA inconsistent results - Stack Overflow
I am trying to do PCA from sklearn with n_components = 5 . I apply the dimensionality reduction on my data using fit_transform(data)...
Read more >
sklearn.decomposition.PCA — scikit-learn 1.2.0 documentation
Used when the 'arpack' or 'randomized' solvers are used. Pass an int for reproducible results across multiple function calls. See Glossary. New in...
Read more >
Parallel Processing and Applied Mathematics - Springer
Enables the Execution of Applications in Multiple Parallel Environments” ... Whenever a code is decomposed into parallel regions or tasks, the number of....
Read more >
Untitled
Problem in shape-analysis pipeline with calling the program to condense qsub jobs for population studies. When trying to condense 945 lddmm qsub jobs, ......
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found