question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

PCA fails with batch highly-variable gene correction

See original GitHub issue

With the new batch_key option in highly_variable_genes downstream functions like PCA can fail silently with the old defaults. The same is true for sc.pl.highly_variable_genes(adata) which currently doesn’t recognize the output key in adata.var is highly_variable_intersection rather than highly_variable.

sc.pp.highly_variable_genes(adata, min_mean=0.0125, max_mean=10, min_disp=0.1, batch_key="source")
adata_hvg = adata[:, adata.var.highly_variable_intersection].copy()
sc.tl.pca(adata_hvg, svd_solver='arpack', n_comps = 30, use_highly_variable=True) # both the default None and True will error; see below
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-125-322839e541fd> in <module>
----> 1 sc.tl.pca(adata_hvg, svd_solver='arpack', n_comps = 30, use_highly_variable=True)

~/anaconda2/envs/scanpy/lib/python3.6/site-packages/scanpy/preprocessing/_simple.py in pca(data, n_comps, zero_center, svd_solver, random_state, return_info, use_highly_variable, dtype, copy, chunked, chunk_size)
    529             pca_ = TruncatedSVD(n_components=n_comps, random_state=random_state)
    530             X = adata_comp.X
--> 531         X_pca = pca_.fit_transform(X)
    532 
    533     if X_pca.dtype.descr != np.dtype(dtype).descr: X_pca = X_pca.astype(dtype)

~/anaconda2/envs/scanpy/lib/python3.6/site-packages/sklearn/decomposition/pca.py in fit_transform(self, X, y)
    358 
    359         """
--> 360         U, S, V = self._fit(X)
    361         U = U[:, :self.n_components_]
    362 

~/anaconda2/envs/scanpy/lib/python3.6/site-packages/sklearn/decomposition/pca.py in _fit(self, X)
    380 
    381         X = check_array(X, dtype=[np.float64, np.float32], ensure_2d=True,
--> 382                         copy=self.copy)
    383 
    384         # Handle n_components==None

~/anaconda2/envs/scanpy/lib/python3.6/site-packages/sklearn/utils/validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
    556                              " a minimum of %d is required%s."
    557                              % (n_features, array.shape, ensure_min_features,
--> 558                                 context))
    559 
    560     if warn_on_dtype and dtype_orig is not None and array.dtype != dtype_orig:

ValueError: Found array with 0 feature(s) (shape=(44495, 0)) while a minimum of 1 is required.

The pca code doesn’t error here, because highly_variable_intersection makes 'highly_variable' in adata.var.keys() evaluate to True:

    if use_highly_variable is True and 'highly_variable' not in adata.var.keys():
        raise ValueError('Did not find adata.var[\'highly_variable\']. '
                         'Either your data already only consists of highly-variable genes '
                         'or consider running `pp.highly_variable_genes` first.')
    if use_highly_variable is None:
        use_highly_variable = True if 'highly_variable' in adata.var.keys() else False
    if use_highly_variable:
        logg.info('    on highly variable genes')
    adata_comp = adata[:, adata.var['highly_variable']] if use_highly_variable else adata

adata.var.keys()
Index(['mito', 'n_cells_by_counts', 'mean_counts', 'log1p_mean_counts',
       'pct_dropout_by_counts', 'total_counts', 'log1p_total_counts',
       'n_cells', 'highly_variable', 'means', 'dispersions',
       'dispersions_norm', 'highly_variable_nbatches',
       'highly_variable_intersection'],
      dtype='object')

Versions:

scanpy==1.4.5.post2 anndata==0.7.1 umap==0.3.10 numpy==1.17.0 scipy==1.3.0 pandas==0.24.2 scikit-learn==0.21.3 statsmodels==0.11.0dev0+630.g4565348 python-igraph==0.7.1 louvain==0.6.1

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:11 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
LuckyMDcommented, Feb 7, 2020

Maybe a solution would be to set highly_variable equal to highly_variable_intersection when using the batch_key. I think highly_variable is a remnant of using highly_variable_genes_single_batch() (or whatever the function is called) to get the individual per-batch HVGs for intersection calculation. @gokceneraslan will be able to correct me here though.

0reactions
gokceneraslancommented, May 19, 2020

Fixed in #1180 .

Read more comments on GitHub >

github_iconTop Results From Across the Web

PCA fails with batch highly-variable gene correction #1032
With the new batch_key option in highly_variable_genes downstream functions like PCA can fail silently with the old defaults.
Read more >
Batch Effect Correction
I don't know why you think that the batch effect is "not much" in the PCA plot generated with the top 500 highly...
Read more >
A new statistic for identifying batch effects in high-throughput ...
Principal component analysis (PCA) is commonly used as a visual tool to determine whether batch effects exist after applying a global normalization method....
Read more >
Batch Correction | Griffith Lab - RNA-seq
Introduction to Batch correction We highly recommend reading the entire ComBat-Seq ... PCA tries to represent a large set of variables as a...
Read more >
Detecting and Correcting Batch Effects in High-Throughput ...
PCA is a form of unsupervised learning used for data reduction and interpretation. It looks for the linear combination of variables (probes, genes, ......
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found