Inconsistent results from arpack pca implementation using VMs with different numbers of cpus
See original GitHub issueI am finding that my analysis is not perfectly reproducible across different computational platforms. I thought I was going crazy but I have since reproduced this finding using the minimal 3000 PBMC dataset clustering example. Essentially I run the same code either on a virtual machine with 8 CPUs or one with 16 CPUs and I get non-identical PCA results. It doesn’t seem to matter if I use the arpack or the randomized solver even though using the randomized solver gives the warning:
Note that scikit-learn's randomized PCA might not be exactly reproducible across different computational platforms. For exact reproducibility, choose
svd_solver=‘arpack’. This will likely become the Scanpy default in the future.
I’d like to just attach the jupyter notebook but it won’t seem to let me do that so I’m copying the code below. …
# First run on a machine with 8 CPUs
import numpy as np
import pandas as pd
import scanpy as sc
adata = sc.read_10x_mtx(
'./data/filtered_gene_bc_matrices/hg19/',
var_names='gene_symbols',
cache=True)
sc.pp.filter_cells(adata, min_genes=200)
sc.pp.filter_genes(adata, min_cells=3)
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)
adata = adata.copy()
sc.pp.scale(adata, max_value=10)
sc.pp.highly_variable_genes(adata, min_mean=0.0125, max_mean=3, min_disp=0.5)
adata = adata[:, adata.var.highly_variable]
sc.tl.pca(adata, svd_solver='arpack', random_state=14)
sc.pp.neighbors(adata, n_neighbors=10, n_pcs=40, random_state=14)
sc.write('test8.h5ad', adata)
sc.tl.pca(adata, svd_solver='randomized', random_state=14)
sc.pp.neighbors(adata, n_neighbors=10, n_pcs=40, random_state=14)
sc.write('test8_randomized.h5ad', adata)
# Then run on a machine with 16 CPUs
import numpy as np
import pandas as pd
import scanpy as sc
adata = sc.read_10x_mtx(
'./data/filtered_gene_bc_matrices/hg19/',
var_names='gene_symbols',
cache=True)
sc.pp.filter_cells(adata, min_genes=200)
sc.pp.filter_genes(adata, min_cells=3)
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)
adata = adata.copy()
sc.pp.scale(adata, max_value=10)
sc.pp.highly_variable_genes(adata, min_mean=0.0125, max_mean=3, min_disp=0.5)
adata = adata[:, adata.var.highly_variable]
sc.tl.pca(adata, svd_solver='arpack', random_state=14)
sc.pp.neighbors(adata, n_neighbors=10, n_pcs=40, random_state=14)
sc.write('test16.h5ad', adata)
sc.tl.pca(adata, svd_solver='randomized', random_state=14)
sc.pp.neighbors(adata, n_neighbors=10, n_pcs=40, random_state=14)
sc.write('test16_randomized.h5ad', adata)
# Running on a machine with 16 CPUs, evaluate the differences between the results first from the arpack solver
adata8 = sc.read('test8.h5ad')
adata16 = sc.read('test16.h5ad')
print((adata8.X != adata16.X).sum())
print((adata8.obsm['X_pca'] != adata16.obsm['X_pca']).sum())
print((adata8.uns['neighbors']['connectivities'] != adata16.uns['neighbors']['connectivities']).sum())
sc.tl.leiden(adata8, random_state=14)
sc.tl.leiden(adata16, random_state=14)
display(adata8.obs['leiden'].value_counts())
display(adata16.obs['leiden'].value_counts())
# Running on a machine with 16 CPUs, evaluate the differences between the results from the randomized solver
adata8 = sc.read('test8_randomized.h5ad')
adata16 = sc.read('test16_randomized.h5ad')
print((adata8.X != adata16.X).sum())
print((adata8.obsm['X_pca'] != adata16.obsm['X_pca']).sum())
print((adata8.uns['neighbors']['connectivities'] != adata16.uns['neighbors']['connectivities']).sum())
sc.tl.leiden(adata8, random_state=14)
sc.tl.leiden(adata16, random_state=14)
display(adata8.obs['leiden'].value_counts())
display(adata16.obs['leiden'].value_counts())
This outputs the following
0
134513
37696
0 659
1 605
2 398
3 352
4 342
5 174
6 118
7 41
8 11
Name: leiden, dtype: int64
0 527
1 484
2 398
3 324
4 320
5 301
6 174
7 109
8 52
9 11
Name: leiden, dtype: int64
0
134127
37278
0 646
1 617
2 382
3 362
4 334
5 173
6 129
7 46
8 11
Name: leiden, dtype: int64
0 646
1 631
2 408
3 349
4 334
5 170
6 106
7 45
8 11
Name: leiden, dtype: int64
...
Versions:
scanpy==1.4.4.post1 anndata==0.7.1 umap==0.3.10 numpy==1.18.1 scipy==1.4.1 pandas==1.0.3 scikit-learn==0.22.2.post1 statsmodels==0.11.1 python-igraph==0.8.0 louvain==0.6.1
Issue Analytics
- State:
- Created 3 years ago
- Comments:11 (6 by maintainers)
IIRC, you can limit the number of CPUs used through blas. This works on my machine:
Different blas libraries use different environment variables for this, so I’d check to make sure it’s actually restricting the number of threads used.
Right, I think it makes sense that this would happen in a docker container based on what I’m seeing. I’ll let you know if I find a solution!