Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

PCA fails with batch highly-variable gene correction

See original GitHub issue

With the new batch_key option in highly_variable_genes downstream functions like PCA can fail silently with the old defaults. The same is true for sc.pl.highly_variable_genes(adata) which currently doesn’t recognize the output key in adata.var is highly_variable_intersection rather than highly_variable.

sc.pp.highly_variable_genes(adata, min_mean=0.0125, max_mean=10, min_disp=0.1, batch_key="source")
adata_hvg = adata[:, adata.var.highly_variable_intersection].copy()
sc.tl.pca(adata_hvg, svd_solver='arpack', n_comps = 30, use_highly_variable=True) # both the default None and True will error; see below

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-125-322839e541fd> in <module>
----> 1 sc.tl.pca(adata_hvg, svd_solver='arpack', n_comps = 30, use_highly_variable=True)

~/anaconda2/envs/scanpy/lib/python3.6/site-packages/scanpy/preprocessing/_simple.py in pca(data, n_comps, zero_center, svd_solver, random_state, return_info, use_highly_variable, dtype, copy, chunked, chunk_size)
    529             pca_ = TruncatedSVD(n_components=n_comps, random_state=random_state)
    530             X = adata_comp.X
--> 531         X_pca = pca_.fit_transform(X)
    532 
    533     if X_pca.dtype.descr != np.dtype(dtype).descr: X_pca = X_pca.astype(dtype)

~/anaconda2/envs/scanpy/lib/python3.6/site-packages/sklearn/decomposition/pca.py in fit_transform(self, X, y)
    358 
    359         """
--> 360         U, S, V = self._fit(X)
    361         U = U[:, :self.n_components_]
    362 

~/anaconda2/envs/scanpy/lib/python3.6/site-packages/sklearn/decomposition/pca.py in _fit(self, X)
    380 
    381         X = check_array(X, dtype=[np.float64, np.float32], ensure_2d=True,
--> 382                         copy=self.copy)
    383 
    384         # Handle n_components==None

~/anaconda2/envs/scanpy/lib/python3.6/site-packages/sklearn/utils/validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
    556                              " a minimum of %d is required%s."
    557                              % (n_features, array.shape, ensure_min_features,
--> 558                                 context))
    559 
    560     if warn_on_dtype and dtype_orig is not None and array.dtype != dtype_orig:

ValueError: Found array with 0 feature(s) (shape=(44495, 0)) while a minimum of 1 is required.

The pca code doesn’t error here, because highly_variable_intersection makes 'highly_variable' in adata.var.keys() evaluate to True:

    if use_highly_variable is True and 'highly_variable' not in adata.var.keys():
        raise ValueError('Did not find adata.var[\'highly_variable\']. '
                         'Either your data already only consists of highly-variable genes '
                         'or consider running `pp.highly_variable_genes` first.')
    if use_highly_variable is None:
        use_highly_variable = True if 'highly_variable' in adata.var.keys() else False
    if use_highly_variable:
        logg.info('    on highly variable genes')
    adata_comp = adata[:, adata.var['highly_variable']] if use_highly_variable else adata

adata.var.keys()
Index(['mito', 'n_cells_by_counts', 'mean_counts', 'log1p_mean_counts',
       'pct_dropout_by_counts', 'total_counts', 'log1p_total_counts',
       'n_cells', 'highly_variable', 'means', 'dispersions',
       'dispersions_norm', 'highly_variable_nbatches',
       'highly_variable_intersection'],
      dtype='object')

Versions:

scanpy==1.4.5.post2 anndata==0.7.1 umap==0.3.10 numpy==1.17.0 scipy==1.3.0 pandas==0.24.2 scikit-learn==0.21.3 statsmodels==0.11.0dev0+630.g4565348 python-igraph==0.7.1 louvain==0.6.1

Issue Analytics

State:
Created 4 years ago
Comments:11 (5 by maintainers)

Top GitHub Comments

1reaction

LuckyMDcommented, Feb 7, 2020

Maybe a solution would be to set highly_variable equal to highly_variable_intersection when using the batch_key. I think highly_variable is a remnant of using highly_variable_genes_single_batch() (or whatever the function is called) to get the individual per-batch HVGs for intersection calculation. @gokceneraslan will be able to correct me here though.

0reactions

gokceneraslancommented, May 19, 2020

Fixed in #1180 .

Top Results From Across the Web

PCA fails with batch highly-variable gene correction #1032

With the new batch_key option in highly_variable_genes downstream functions like PCA can fail silently with the old defaults.

Batch Effect Correction

I don't know why you think that the batch effect is "not much" in the PCA plot generated with the top 500 highly...

A new statistic for identifying batch effects in high-throughput ...

Principal component analysis (PCA) is commonly used as a visual tool to determine whether batch effects exist after applying a global normalization method....

Batch Correction | Griffith Lab - RNA-seq

Introduction to Batch correction We highly recommend reading the entire ComBat-Seq ... PCA tries to represent a large set of variables as a...

Detecting and Correcting Batch Effects in High-Throughput ...

PCA is a form of unsupervised learning used for data reduction and interpretation. It looks for the linear combination of variables (probes, genes, ......