Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Inconsistent results from arpack pca implementation using VMs with different numbers of cpus

See original GitHub issue

I am finding that my analysis is not perfectly reproducible across different computational platforms. I thought I was going crazy but I have since reproduced this finding using the minimal 3000 PBMC dataset clustering example. Essentially I run the same code either on a virtual machine with 8 CPUs or one with 16 CPUs and I get non-identical PCA results. It doesn’t seem to matter if I use the arpack or the randomized solver even though using the randomized solver gives the warning:

Note that scikit-learn's randomized PCA might not be exactly reproducible across different computational platforms. For exact reproducibility, choose svd_solver=‘arpack’. This will likely become the Scanpy default in the future.

I’d like to just attach the jupyter notebook but it won’t seem to let me do that so I’m copying the code below. …

# First run on a machine with 8 CPUs
import numpy as np
import pandas as pd
import scanpy as sc
adata = sc.read_10x_mtx(
    './data/filtered_gene_bc_matrices/hg19/', 
    var_names='gene_symbols',
    cache=True) 

sc.pp.filter_cells(adata, min_genes=200)
sc.pp.filter_genes(adata, min_cells=3)
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)
adata = adata.copy()
sc.pp.scale(adata, max_value=10)
sc.pp.highly_variable_genes(adata, min_mean=0.0125, max_mean=3, min_disp=0.5)
adata = adata[:, adata.var.highly_variable]
sc.tl.pca(adata, svd_solver='arpack', random_state=14)
sc.pp.neighbors(adata, n_neighbors=10, n_pcs=40, random_state=14)
sc.write('test8.h5ad', adata)
sc.tl.pca(adata, svd_solver='randomized', random_state=14)
sc.pp.neighbors(adata, n_neighbors=10, n_pcs=40, random_state=14)
sc.write('test8_randomized.h5ad', adata)

# Then run on a machine with 16 CPUs
import numpy as np
import pandas as pd
import scanpy as sc
adata = sc.read_10x_mtx(
    './data/filtered_gene_bc_matrices/hg19/', 
    var_names='gene_symbols',
    cache=True) 

sc.pp.filter_cells(adata, min_genes=200)
sc.pp.filter_genes(adata, min_cells=3)
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)
adata = adata.copy()
sc.pp.scale(adata, max_value=10)
sc.pp.highly_variable_genes(adata, min_mean=0.0125, max_mean=3, min_disp=0.5)
adata = adata[:, adata.var.highly_variable]
sc.tl.pca(adata, svd_solver='arpack', random_state=14)
sc.pp.neighbors(adata, n_neighbors=10, n_pcs=40, random_state=14)
sc.write('test16.h5ad', adata)
sc.tl.pca(adata, svd_solver='randomized', random_state=14)
sc.pp.neighbors(adata, n_neighbors=10, n_pcs=40, random_state=14)
sc.write('test16_randomized.h5ad', adata)

# Running on a machine with 16 CPUs, evaluate the differences between the results first from the arpack solver
adata8 = sc.read('test8.h5ad')
adata16 = sc.read('test16.h5ad')
print((adata8.X != adata16.X).sum())
print((adata8.obsm['X_pca'] != adata16.obsm['X_pca']).sum())
print((adata8.uns['neighbors']['connectivities'] != adata16.uns['neighbors']['connectivities']).sum())
sc.tl.leiden(adata8, random_state=14)
sc.tl.leiden(adata16, random_state=14)
display(adata8.obs['leiden'].value_counts())
display(adata16.obs['leiden'].value_counts())

# Running on a machine with 16 CPUs, evaluate the differences between the results from the randomized solver
adata8 = sc.read('test8_randomized.h5ad')
adata16 = sc.read('test16_randomized.h5ad')
print((adata8.X != adata16.X).sum())
print((adata8.obsm['X_pca'] != adata16.obsm['X_pca']).sum())
print((adata8.uns['neighbors']['connectivities'] != adata16.uns['neighbors']['connectivities']).sum())
sc.tl.leiden(adata8, random_state=14)
sc.tl.leiden(adata16, random_state=14)
display(adata8.obs['leiden'].value_counts())
display(adata16.obs['leiden'].value_counts())

This outputs the following

0
134513
37696
0    659
1    605
2    398
3    352
4    342
5    174
6    118
7     41
8     11
Name: leiden, dtype: int64
0    527
1    484
2    398
3    324
4    320
5    301
6    174
7    109
8     52
9     11
Name: leiden, dtype: int64

0
134127
37278
0    646
1    617
2    382
3    362
4    334
5    173
6    129
7     46
8     11
Name: leiden, dtype: int64
0    646
1    631
2    408
3    349
4    334
5    170
6    106
7     45
8     11
Name: leiden, dtype: int64

...

Versions:

scanpy==1.4.4.post1 anndata==0.7.1 umap==0.3.10 numpy==1.18.1 scipy==1.4.1 pandas==1.0.3 scikit-learn==0.22.2.post1 statsmodels==0.11.1 python-igraph==0.8.0 louvain==0.6.1

Issue Analytics

State:
Created 3 years ago
Comments:11 (6 by maintainers)

Top GitHub Comments

1reaction

ivirshupcommented, Apr 30, 2020

IIRC, you can limit the number of CPUs used through blas. This works on my machine:

export OMP_NUM_THREADS=1

Different blas libraries use different environment variables for this, so I’d check to make sure it’s actually restricting the number of threads used.

1reaction

dylkotcommented, Apr 28, 2020

Right, I think it makes sense that this would happen in a docker container based on what I’m seeing. I’ll let you know if I find a solution!

Top Results From Across the Web

Inconsistent results from arpack pca implementation using ...

I am just loading them up with different virtual machines with different numbers of CPUs. In both cases the CPUs are Intel Xeon...

python - fit_transform PCA inconsistent results - Stack Overflow

I am trying to do PCA from sklearn with n_components = 5 . I apply the dimensionality reduction on my data using fit_transform(data)...

sklearn.decomposition.PCA — scikit-learn 1.2.0 documentation

Used when the 'arpack' or 'randomized' solvers are used. Pass an int for reproducible results across multiple function calls. See Glossary. New in...

Parallel Processing and Applied Mathematics - Springer

Enables the Execution of Applications in Multiple Parallel Environments” ... Whenever a code is decomposed into parallel regions or tasks, the number of....

Untitled

Problem in shape-analysis pipeline with calling the program to condense qsub jobs for population studies. When trying to condense 945 lddmm qsub jobs, ......