PCA fails with batch highly-variable gene correction
See original GitHub issueWith the new batch_key
option in highly_variable_genes
downstream functions like PCA can fail silently with the old defaults. The same is true for sc.pl.highly_variable_genes(adata)
which currently doesn’t recognize the output key in adata.var
is highly_variable_intersection
rather than highly_variable
.
sc.pp.highly_variable_genes(adata, min_mean=0.0125, max_mean=10, min_disp=0.1, batch_key="source")
adata_hvg = adata[:, adata.var.highly_variable_intersection].copy()
sc.tl.pca(adata_hvg, svd_solver='arpack', n_comps = 30, use_highly_variable=True) # both the default None and True will error; see below
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-125-322839e541fd> in <module>
----> 1 sc.tl.pca(adata_hvg, svd_solver='arpack', n_comps = 30, use_highly_variable=True)
~/anaconda2/envs/scanpy/lib/python3.6/site-packages/scanpy/preprocessing/_simple.py in pca(data, n_comps, zero_center, svd_solver, random_state, return_info, use_highly_variable, dtype, copy, chunked, chunk_size)
529 pca_ = TruncatedSVD(n_components=n_comps, random_state=random_state)
530 X = adata_comp.X
--> 531 X_pca = pca_.fit_transform(X)
532
533 if X_pca.dtype.descr != np.dtype(dtype).descr: X_pca = X_pca.astype(dtype)
~/anaconda2/envs/scanpy/lib/python3.6/site-packages/sklearn/decomposition/pca.py in fit_transform(self, X, y)
358
359 """
--> 360 U, S, V = self._fit(X)
361 U = U[:, :self.n_components_]
362
~/anaconda2/envs/scanpy/lib/python3.6/site-packages/sklearn/decomposition/pca.py in _fit(self, X)
380
381 X = check_array(X, dtype=[np.float64, np.float32], ensure_2d=True,
--> 382 copy=self.copy)
383
384 # Handle n_components==None
~/anaconda2/envs/scanpy/lib/python3.6/site-packages/sklearn/utils/validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
556 " a minimum of %d is required%s."
557 % (n_features, array.shape, ensure_min_features,
--> 558 context))
559
560 if warn_on_dtype and dtype_orig is not None and array.dtype != dtype_orig:
ValueError: Found array with 0 feature(s) (shape=(44495, 0)) while a minimum of 1 is required.
The pca
code doesn’t error here, because highly_variable_intersection
makes 'highly_variable' in adata.var.keys()
evaluate to True
:
if use_highly_variable is True and 'highly_variable' not in adata.var.keys():
raise ValueError('Did not find adata.var[\'highly_variable\']. '
'Either your data already only consists of highly-variable genes '
'or consider running `pp.highly_variable_genes` first.')
if use_highly_variable is None:
use_highly_variable = True if 'highly_variable' in adata.var.keys() else False
if use_highly_variable:
logg.info(' on highly variable genes')
adata_comp = adata[:, adata.var['highly_variable']] if use_highly_variable else adata
adata.var.keys()
Index(['mito', 'n_cells_by_counts', 'mean_counts', 'log1p_mean_counts',
'pct_dropout_by_counts', 'total_counts', 'log1p_total_counts',
'n_cells', 'highly_variable', 'means', 'dispersions',
'dispersions_norm', 'highly_variable_nbatches',
'highly_variable_intersection'],
dtype='object')
Versions:
scanpy==1.4.5.post2 anndata==0.7.1 umap==0.3.10 numpy==1.17.0 scipy==1.3.0 pandas==0.24.2 scikit-learn==0.21.3 statsmodels==0.11.0dev0+630.g4565348 python-igraph==0.7.1 louvain==0.6.1
Issue Analytics
- State:
- Created 4 years ago
- Comments:11 (5 by maintainers)
Top Results From Across the Web
PCA fails with batch highly-variable gene correction #1032
With the new batch_key option in highly_variable_genes downstream functions like PCA can fail silently with the old defaults.
Read more >Batch Effect Correction
I don't know why you think that the batch effect is "not much" in the PCA plot generated with the top 500 highly...
Read more >A new statistic for identifying batch effects in high-throughput ...
Principal component analysis (PCA) is commonly used as a visual tool to determine whether batch effects exist after applying a global normalization method....
Read more >Batch Correction | Griffith Lab - RNA-seq
Introduction to Batch correction We highly recommend reading the entire ComBat-Seq ... PCA tries to represent a large set of variables as a...
Read more >Detecting and Correcting Batch Effects in High-Throughput ...
PCA is a form of unsupervised learning used for data reduction and interpretation. It looks for the linear combination of variables (probes, genes, ......
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Maybe a solution would be to set
highly_variable
equal tohighly_variable_intersection
when using thebatch_key
. I thinkhighly_variable
is a remnant of usinghighly_variable_genes_single_batch()
(or whatever the function is called) to get the individual per-batch HVGs for intersection calculation. @gokceneraslan will be able to correct me here though.Fixed in #1180 .