Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Returning cluster assignments as str conflicts with matplotlib color sequences

See original GitHub issue

Currently, sc.tl.louvain etc return cluster assignments as a Categorical with dtype str resulting in incompatibility with matplotlib color sequences. For example, the following code raises a ValueError:

import numpy as np
import scanpy as sc
import matplotlib.pyplot as plt

adata = sc.AnnData(np.random.normal(size=(100,2)))
sc.pp.neighbors(adata)
sc.tl.louvain(adata)
plt.scatter(adata.X[:,0], adata.X[:,1], c=adata.obs['louvain'])

The error is: ValueError: RGBA values should be within 0-1 range. Funnily enough, this used to work due to a bug in matplotlib that was fixed in https://github.com/matplotlib/matplotlib/pull/13913.

Note, the following code works as intended:

plt.scatter(adata.X[:,0], adata.X[:,1], c=adata.obs['louvain'].astype(int))

I would have submitted a PR changing this behavior had I not noticed that returning cluster assignments as str is explicitly checked here:

https://github.com/theislab/scanpy/blob/78125e6355c0cd2c4ae930495829282eea6f4a52/scanpy/tools/_utils_clustering.py#L11-L23

This brings up a larger design question in scanpy / anndata: Why are arrays of numerics routinely converted to strings representing numbers?

In https://github.com/theislab/anndata/issues/311 I found a case where converting arrays of numerics to strings creates a bug when assigning to AnnData obsm with DataFrames with a RangeIndex. In that case, I understand there’s a desire to avoid ambiguity in positional vs label indexing, but that issue was solved in pandas with the .loc and .iloc conventions. Why not carry that forward?

In this case, why not just return cluster assignments as arrays of numerics as is done in sklearn.cluster?

I think following these conventions will make both tools much more accessible to the general Python data science community.

Issue Analytics

State:
Created 4 years ago
Comments:19 (13 by maintainers)

Top GitHub Comments

1reaction

LuckyMDcommented, Apr 3, 2020

I believe there’s a tutorial with this somewhere. Do you know where this was @LuckyMD?

I’m not so familiar with the scanpy tutorials, but I do show sub-clustering in the single-cell-tutorial notebook here

1reaction

dburkhardtcommented, Apr 2, 2020

Hi, I just wanted to bring this back up again because I’ve been logging some of the issue’s I’ve encountered. It seems we’re at a bit of a philosophical divide, so perhaps it’s best for me to just register which use cases I have that AnnData / scanpy are personally causing me friction:

Instead of pasting all errors, I’m just going to paste code blocks I wish worked. Note, these are actual use cases I have regularly encountered.

1. Cannot pass AnnData to numpy or sklearn operators

import scanpy as sc
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import decomposition, cluster

data = np.random.normal(size=(100,10))
adata = sc.AnnData(data)

# All of the following raise errors
np.sqrt(adata)
adata[:, adata.var_names[0:3]] - adata[:, adata.var_names[3:6]]

adata.obsm['X_PCA'] = decomposition.PCA(2).fit_transform(adata)

To answer the question above, I think it should return the whole AnnData object, like how DataFrames return themselves. I don’t know if we think it should “update” the original AnnData. I’m also confused by how this results in a performance decrease? If I do adata = np.sqrt(adata) then isn’t this the same footprint as modifying inplace? If I do adata_sq = np.sqrt(adata) then my intention is to duplicate the adata object. In this case, it is my intention to create a duplicate object, and I would like AnnData to respect this intention. 2. Requirement to use .var_vector or .obs_vector for single columns

# This works as expected
adata[:, adata.var_names[0:3]]

# I wish this did as well.
adata[:, adata.var_names[0]]

3. .var_vector doesn’t return a Series

pdata = pd.DataFrame(data)
# Returns series
pdata[0]

# Returns ndarray
adata.var_vector[0]

4. Clusters as categories creates confusing scatterplots

sc.pp.neighbors(adata)
sc.tl.leiden(adata)

plt.scatter(adata.obs['leiden'], adata.X[:,0])

Produces the following plot. I would like it to have order 0-5 by default

5. Cannot pass clusters to c parameter in plt.scatter I would like this to just work. Instead it throws a huge error.

plt.scatter(adata.X[:,0], adata.X[:,1], c=adata.obs['leiden'])

6. Clusters as categories frustrate subclustering I understand this is a niche application, but like 4 and 5, this would be fixed by matching the output of sklearn.cluster operators.

sc.pp.neighbors(adata)
sc.tl.leiden(adata)

cluster_zero = adata[adata.obs['leiden'] == '0']
sub_clusters = cluster.KMeans(n_clusters=2).fit_predict(adata.X)

# Here I'm trying to break up cluster '0' into subclusters with 
# new names that don't clash with the existing clusters
# However, np.max() and the + operators aren't well defined for 
# cateogricals of strings
sub_clusters = sub_clusters + np.max(adata.obs['leiden'])