question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Returning cluster assignments as str conflicts with matplotlib color sequences

See original GitHub issue

Currently, sc.tl.louvain etc return cluster assignments as a Categorical with dtype str resulting in incompatibility with matplotlib color sequences. For example, the following code raises a ValueError:

import numpy as np
import scanpy as sc
import matplotlib.pyplot as plt

adata = sc.AnnData(np.random.normal(size=(100,2)))
sc.pp.neighbors(adata)
sc.tl.louvain(adata)
plt.scatter(adata.X[:,0], adata.X[:,1], c=adata.obs['louvain'])

The error is: ValueError: RGBA values should be within 0-1 range. Funnily enough, this used to work due to a bug in matplotlib that was fixed in https://github.com/matplotlib/matplotlib/pull/13913.

Note, the following code works as intended:

plt.scatter(adata.X[:,0], adata.X[:,1], c=adata.obs['louvain'].astype(int))

I would have submitted a PR changing this behavior had I not noticed that returning cluster assignments as str is explicitly checked here:

https://github.com/theislab/scanpy/blob/78125e6355c0cd2c4ae930495829282eea6f4a52/scanpy/tools/_utils_clustering.py#L11-L23

This brings up a larger design question in scanpy / anndata: Why are arrays of numerics routinely converted to strings representing numbers?

In https://github.com/theislab/anndata/issues/311 I found a case where converting arrays of numerics to strings creates a bug when assigning to AnnData obsm with DataFrames with a RangeIndex. In that case, I understand there’s a desire to avoid ambiguity in positional vs label indexing, but that issue was solved in pandas with the .loc and .iloc conventions. Why not carry that forward?

In this case, why not just return cluster assignments as arrays of numerics as is done in sklearn.cluster?

I think following these conventions will make both tools much more accessible to the general Python data science community.

Issue Analytics

  • State:open
  • Created 4 years ago
  • Comments:19 (13 by maintainers)

github_iconTop GitHub Comments

1reaction
LuckyMDcommented, Apr 3, 2020

I believe there’s a tutorial with this somewhere. Do you know where this was @LuckyMD?

I’m not so familiar with the scanpy tutorials, but I do show sub-clustering in the single-cell-tutorial notebook here

1reaction
dburkhardtcommented, Apr 2, 2020

Hi, I just wanted to bring this back up again because I’ve been logging some of the issue’s I’ve encountered. It seems we’re at a bit of a philosophical divide, so perhaps it’s best for me to just register which use cases I have that AnnData / scanpy are personally causing me friction:

Instead of pasting all errors, I’m just going to paste code blocks I wish worked. Note, these are actual use cases I have regularly encountered.

1. Cannot pass AnnData to numpy or sklearn operators

import scanpy as sc
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import decomposition, cluster

data = np.random.normal(size=(100,10))
adata = sc.AnnData(data)

# All of the following raise errors
np.sqrt(adata)
adata[:, adata.var_names[0:3]] - adata[:, adata.var_names[3:6]]

adata.obsm['X_PCA'] = decomposition.PCA(2).fit_transform(adata)

To answer the question above, I think it should return the whole AnnData object, like how DataFrames return themselves. I don’t know if we think it should “update” the original AnnData. I’m also confused by how this results in a performance decrease? If I do adata = np.sqrt(adata) then isn’t this the same footprint as modifying inplace? If I do adata_sq = np.sqrt(adata) then my intention is to duplicate the adata object. In this case, it is my intention to create a duplicate object, and I would like AnnData to respect this intention. 2. Requirement to use .var_vector or .obs_vector for single columns

# This works as expected
adata[:, adata.var_names[0:3]]

# I wish this did as well.
adata[:, adata.var_names[0]]

3. .var_vector doesn’t return a Series

pdata = pd.DataFrame(data)
# Returns series
pdata[0]

# Returns ndarray
adata.var_vector[0]

4. Clusters as categories creates confusing scatterplots

sc.pp.neighbors(adata)
sc.tl.leiden(adata)

plt.scatter(adata.obs['leiden'], adata.X[:,0])

Produces the following plot. I would like it to have order 0-5 by default

image

5. Cannot pass clusters to c parameter in plt.scatter I would like this to just work. Instead it throws a huge error.

plt.scatter(adata.X[:,0], adata.X[:,1], c=adata.obs['leiden'])

6. Clusters as categories frustrate subclustering I understand this is a niche application, but like 4 and 5, this would be fixed by matching the output of sklearn.cluster operators.

sc.pp.neighbors(adata)
sc.tl.leiden(adata)

cluster_zero = adata[adata.obs['leiden'] == '0']
sub_clusters = cluster.KMeans(n_clusters=2).fit_predict(adata.X)

# Here I'm trying to break up cluster '0' into subclusters with 
# new names that don't clash with the existing clusters
# However, np.max() and the + operators aren't well defined for 
# cateogricals of strings
sub_clusters = sub_clusters + np.max(adata.obs['leiden'])
Read more comments on GitHub >

github_iconTop Results From Across the Web

Returning cluster assignments as str conflicts with matplotlib ...
Currently, sc.tl.louvain etc return cluster assignments as a Categorical with dtype str resulting in incompatibility with matplotlib color ...
Read more >
Chapter 4. Visualization with Matplotlib - O'Reilly
plot is that it can be used to create scatter plots where the properties of each individual point (size, face color, edge color,...
Read more >
colors — Matplotlib 2.0.2 documentation
A module for converting numbers or color arguments to RGB or RGBA. RGB and RGBA are sequences of, respectively, 3 or 4 floats...
Read more >
Top 50 matplotlib Visualizations - The Master Plots (w
A compilation of the Top 50 matplotlib plots most useful in data ... in your data you may want to visualise each group...
Read more >
Biopython Tutorial and Cookbook
a standard sequence class, various clustering modules, a KD tree data ... Since calling str() on a Seq object returns the full sequence...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found