Returning cluster assignments as str conflicts with matplotlib color sequences
See original GitHub issueCurrently, sc.tl.louvain
etc return cluster assignments as a Categorical with dtype str
resulting in incompatibility with matplotlib color sequences. For example, the following code raises a ValueError:
import numpy as np
import scanpy as sc
import matplotlib.pyplot as plt
adata = sc.AnnData(np.random.normal(size=(100,2)))
sc.pp.neighbors(adata)
sc.tl.louvain(adata)
plt.scatter(adata.X[:,0], adata.X[:,1], c=adata.obs['louvain'])
The error is: ValueError: RGBA values should be within 0-1 range
. Funnily enough, this used to work due to a bug in matplotlib that was fixed in https://github.com/matplotlib/matplotlib/pull/13913.
Note, the following code works as intended:
plt.scatter(adata.X[:,0], adata.X[:,1], c=adata.obs['louvain'].astype(int))
I would have submitted a PR changing this behavior had I not noticed that returning cluster assignments as str
is explicitly checked here:
This brings up a larger design question in scanpy / anndata: Why are arrays of numerics routinely converted to strings representing numbers?
In https://github.com/theislab/anndata/issues/311
I found a case where converting arrays of numerics to strings creates a bug when assigning to AnnData obsm
with DataFrames with a RangeIndex. In that case, I understand there’s a desire to avoid ambiguity in positional vs label indexing, but that issue was solved in pandas with the .loc
and .iloc
conventions. Why not carry that forward?
In this case, why not just return cluster assignments as arrays of numerics as is done in sklearn.cluster
?
I think following these conventions will make both tools much more accessible to the general Python data science community.
Issue Analytics
- State:
- Created 4 years ago
- Comments:19 (13 by maintainers)
I’m not so familiar with the scanpy tutorials, but I do show sub-clustering in the single-cell-tutorial notebook here
Hi, I just wanted to bring this back up again because I’ve been logging some of the issue’s I’ve encountered. It seems we’re at a bit of a philosophical divide, so perhaps it’s best for me to just register which use cases I have that AnnData / scanpy are personally causing me friction:
Instead of pasting all errors, I’m just going to paste code blocks I wish worked. Note, these are actual use cases I have regularly encountered.
1. Cannot pass AnnData to numpy or sklearn operators
To answer the question above, I think it should return the whole AnnData object, like how DataFrames return themselves. I don’t know if we think it should “update” the original AnnData. I’m also confused by how this results in a performance decrease? If I do
adata = np.sqrt(adata)
then isn’t this the same footprint as modifying inplace? If I doadata_sq = np.sqrt(adata)
then my intention is to duplicate the adata object. In this case, it is my intention to create a duplicate object, and I would like AnnData to respect this intention. 2. Requirement to use .var_vector or .obs_vector for single columns3. .var_vector doesn’t return a Series
4. Clusters as categories creates confusing scatterplots
Produces the following plot. I would like it to have order 0-5 by default
5. Cannot pass clusters to
c
parameter in plt.scatter I would like this to just work. Instead it throws a huge error.6. Clusters as categories frustrate subclustering I understand this is a niche application, but like 4 and 5, this would be fixed by matching the output of sklearn.cluster operators.