Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Dotplot where sizes are proportional to p-value and the color to log2-fold change?

See original GitHub issue

@fidelram, as discussed today, could we adopt pl.rank_genes_groups_dotplot so that it reads this information from .uns['rank_genes_groups']?

Maybe just a simple switch? Or having arguments color and size be a choice from a selection {pvals, pvals_adj, log2FC, expression, frac-genes-expressed}.

Issue Analytics

State:
Created 4 years ago
Reactions:1
Comments:21 (18 by maintainers)

Top GitHub Comments

2reactions

falexwolfcommented, Apr 28, 2019

Sounds great!

Re tidy: Storing things internally in tidy format also seems inefficient to me… I remember a long discussion with Philipp more than 2 years ago… 😄

Re diffxpy: If you say that diffxpy has a good solution, why should we build a new one? Can’t we just use their solution?

I think there are also two separate problems here, which are “what’s a better way to store differential expression results” and “what’s a good API for differential expression”.

Completely agreed.

I’m interested in the sc.ex module you’re suggesting. Would you mind elaborating a bit more on that, particularly on some functions that would be there?

Re sc.extract

One of the core ideas of Scanpy (as opposed to, say, scikit learn) was to have this model of taking the burden of bookkeeping from the user as much as possible. This design messed up, in particular, the return values of rank_genes_groups. I would have loved to return a collection of dataframes, but I didn’t want to mess this up. Also, the return values of pp.neighbors or pl.paga aren’t great.

There is a trade-off between having nice APIs and return values (such as dataframes) and a transparent and efficient on-disk representation in terms of HDF5, zarr or another format. These days, I’d even consider simply pickling things, which would have saved us a lot of work; but I thought that we’d need established compression facilities, concatenation possibilities, some way to manually “look into” an on-disk object (both from R and from the command line) so that it’s maximally transparent and then the widely established, cross-language, but old-school and not entirely scalable HDF5 seemed the best. The Human Cell Atlas decided in favor of zarr meanwhile. But that’s not a drama, because Scanpy only writes “storage-friendly” values to AnnData, that is, arrays and dicts. HDF5 knows how to handle them and zarr also. If one uses xarray or dataframes, one has to think about how this gets written to disk.

That being said: it’s likely that we’ll continue to choose representations for on-disk (and in-memory) storage that aren’t convenient (rec arrays, for instance), a three-dimensional xarray and dicts.

A general solution for this problem would be the mentioned sc.extract API, similar to sc.plotting (which also completely hides the complexity of the object from the user), but not for returning visualizations, but nice objects.

The first function in that namespace should be sc.ex.neighbors, which should return an instance of sc.Neighbors (which can then disappear from the root API). Similarly, when sc.pp.neighbors is called with inplace=False, one should directly get a Neighbors object returned.

Now, we can apply this logic to every single function that doesn’t have a simple return value. Upon calling the function with inplace=False, you’ll get a “nice” object that is convenient to handle. If you call a function sc.tl.function in a pipeline with inplace=True but later on, you’ll want this nice object, you’d call sc.ex.function.

I think DataFrames (a case like tl.marker_gene_overlap) should definitely be handled within AnnData and no extract function is necessary. But the differential expression result is a prime example for such a case. I think a function rank_genes_groups that returns a RankGenesGroups object, which then has .to_df() function (e.g. the function rank_genes_groups from (https://github.com/theislab/scanpy/pull/619) could immediately go into that namespace. Maybe we can even borrow a diffxpy object for that. The good thing is, we can keep the current rec arrays as they are very efficient and basic data types, which will work with hdf5 and zarr and xarray and everything else that might come in the future. And: Fidel wrote a ton of plotting functions around them already, which we don’t want to simply rewrite… We don’t have to as users won’t see the recarrays anymore…

Other possible names for the API would be sc.cast or sc.object (sc.ob), less conflicting with sc.external. I think sc.ob makes sense as it really makes clear that Scanpy’s main API is for writing convenient scripts for compute-heavy stuff in a functional way. If one wants to transition to more light-weight “post-analysis”, one can transition to objects that are designed for specific tasks.

PS: I’d love to move away from the name rank_genes_groups at some point, and simply have something like difftest or DiffTest… I always thought that we might have differential expression tests for longitudinal data at some point (like Monocle), otherwise the function would be rank_genes but I don’t think this is gonna happen soon, and if, it will be in the external API… A minimal difftest API should though continue be in the core of Scanpy, with at its heart, a scalable Wilcoxon rank (much more scalable than scipy’s or diffxpy’s), the t test and the scikit learn logreg approach. diffxpy with it’s tensorflow dependency can then handle very complex cases…

1reaction

flying-sheepcommented, Jun 20, 2019

OK, we have those alternatives:

Alternative	Pro	Con
Keep everything as it is	People will have the best unterstanding of its structure and not treat it as a black box	Unwieldy
Subclass AnnData in scanpy and add accessor methods/attrs	Nice API	Everyone would start using Scanpy’s AnnData subclass instead of the generic container that I think is a great design choice for extensibility Hides AnnData structure
`sc.get`	Nice API Separates Scanpy-specific API from AnnData API	Hides AnnData structure

I think sc.get is the best option here!

Top Results From Across the Web

Dotplot where sizes are proportional to p-value and the color ...

Maybe just a simple switch? ... Dotplot where sizes are proportional to p-value and the color to log2-fold change? #562.

ClusterProfiler dotplot mapping fold change to colour of dots

Thanks in advance for your help!! My dot plot displays fine but the colour specified by the color = "median(dataFrame_FC)" does not work....

Dot plot visualization — DotPlot • Seurat - Satija Lab

Intuitive way of visualizing how feature expression changes across different identity classes (clusters). The size of the dot encodes the percentage of ...

06 Differential expression analysis – Introduction to RNA-seq

A positive gene fold change means that the gene is upregulated in the P. ... a results table with log2 fold changes, p...

How to make the size of points on a plot proportional to p-value?

You can just bind them into a data.frame and ggplot them: df=data.frame(x,y,pValues) library(ggplot2) ggplot(data=df) + aes(x=x, y=y, ...