question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Dotplot where sizes are proportional to p-value and the color to log2-fold change?

See original GitHub issue

@fidelram, as discussed today, could we adopt pl.rank_genes_groups_dotplot so that it reads this information from .uns['rank_genes_groups']?

Maybe just a simple switch? Or having arguments color and size be a choice from a selection {pvals, pvals_adj, log2FC, expression, frac-genes-expressed}.

Issue Analytics

  • State:open
  • Created 4 years ago
  • Reactions:1
  • Comments:21 (18 by maintainers)

github_iconTop GitHub Comments

2reactions
falexwolfcommented, Apr 28, 2019

Sounds great!

Re tidy: Storing things internally in tidy format also seems inefficient to me… I remember a long discussion with Philipp more than 2 years ago… 😄

Re diffxpy: If you say that diffxpy has a good solution, why should we build a new one? Can’t we just use their solution?

I think there are also two separate problems here, which are “what’s a better way to store differential expression results” and “what’s a good API for differential expression”.

Completely agreed.

I’m interested in the sc.ex module you’re suggesting. Would you mind elaborating a bit more on that, particularly on some functions that would be there?

Re sc.extract

One of the core ideas of Scanpy (as opposed to, say, scikit learn) was to have this model of taking the burden of bookkeeping from the user as much as possible. This design messed up, in particular, the return values of rank_genes_groups. I would have loved to return a collection of dataframes, but I didn’t want to mess this up. Also, the return values of pp.neighbors or pl.paga aren’t great.

There is a trade-off between having nice APIs and return values (such as dataframes) and a transparent and efficient on-disk representation in terms of HDF5, zarr or another format. These days, I’d even consider simply pickling things, which would have saved us a lot of work; but I thought that we’d need established compression facilities, concatenation possibilities, some way to manually “look into” an on-disk object (both from R and from the command line) so that it’s maximally transparent and then the widely established, cross-language, but old-school and not entirely scalable HDF5 seemed the best. The Human Cell Atlas decided in favor of zarr meanwhile. But that’s not a drama, because Scanpy only writes “storage-friendly” values to AnnData, that is, arrays and dicts. HDF5 knows how to handle them and zarr also. If one uses xarray or dataframes, one has to think about how this gets written to disk.

That being said: it’s likely that we’ll continue to choose representations for on-disk (and in-memory) storage that aren’t convenient (rec arrays, for instance), a three-dimensional xarray and dicts.

A general solution for this problem would be the mentioned sc.extract API, similar to sc.plotting (which also completely hides the complexity of the object from the user), but not for returning visualizations, but nice objects.

The first function in that namespace should be sc.ex.neighbors, which should return an instance of sc.Neighbors (which can then disappear from the root API). Similarly, when sc.pp.neighbors is called with inplace=False, one should directly get a Neighbors object returned.

Now, we can apply this logic to every single function that doesn’t have a simple return value. Upon calling the function with inplace=False, you’ll get a “nice” object that is convenient to handle. If you call a function sc.tl.function in a pipeline with inplace=True but later on, you’ll want this nice object, you’d call sc.ex.function.

I think DataFrames (a case like tl.marker_gene_overlap) should definitely be handled within AnnData and no extract function is necessary. But the differential expression result is a prime example for such a case. I think a function rank_genes_groups that returns a RankGenesGroups object, which then has .to_df() function (e.g. the function rank_genes_groups from (https://github.com/theislab/scanpy/pull/619) could immediately go into that namespace. Maybe we can even borrow a diffxpy object for that. The good thing is, we can keep the current rec arrays as they are very efficient and basic data types, which will work with hdf5 and zarr and xarray and everything else that might come in the future. And: Fidel wrote a ton of plotting functions around them already, which we don’t want to simply rewrite… We don’t have to as users won’t see the recarrays anymore…

Other possible names for the API would be sc.cast or sc.object (sc.ob), less conflicting with sc.external. I think sc.ob makes sense as it really makes clear that Scanpy’s main API is for writing convenient scripts for compute-heavy stuff in a functional way. If one wants to transition to more light-weight “post-analysis”, one can transition to objects that are designed for specific tasks.

PS: I’d love to move away from the name rank_genes_groups at some point, and simply have something like difftest or DiffTest… I always thought that we might have differential expression tests for longitudinal data at some point (like Monocle), otherwise the function would be rank_genes but I don’t think this is gonna happen soon, and if, it will be in the external API… A minimal difftest API should though continue be in the core of Scanpy, with at its heart, a scalable Wilcoxon rank (much more scalable than scipy’s or diffxpy’s), the t test and the scikit learn logreg approach. diffxpy with it’s tensorflow dependency can then handle very complex cases…

1reaction
flying-sheepcommented, Jun 20, 2019

OK, we have those alternatives:

Alternative Pro Con
Keep everything as it is People will have the best unterstanding of its structure and not treat it as a black box Unwieldy
Subclass AnnData in scanpy and add accessor methods/attrs Nice API
  • Everyone would start using Scanpy’s AnnData subclass instead of the generic container that I think is a great design choice for extensibility
  • Hides AnnData structure
sc.get
  • Nice API
  • Separates Scanpy-specific API from AnnData API
Hides AnnData structure

I think sc.get is the best option here!

Read more comments on GitHub >

github_iconTop Results From Across the Web

Dotplot where sizes are proportional to p-value and the color ...
Maybe just a simple switch? ... Dotplot where sizes are proportional to p-value and the color to log2-fold change? #562.
Read more >
ClusterProfiler dotplot mapping fold change to colour of dots
Thanks in advance for your help!! My dot plot displays fine but the colour specified by the color = "median(dataFrame_FC)" does not work....
Read more >
Dot plot visualization — DotPlot • Seurat - Satija Lab
Intuitive way of visualizing how feature expression changes across different identity classes (clusters). The size of the dot encodes the percentage of ...
Read more >
06 Differential expression analysis – Introduction to RNA-seq
A positive gene fold change means that the gene is upregulated in the P. ... a results table with log2 fold changes, p...
Read more >
How to make the size of points on a plot proportional to p-value?
You can just bind them into a data.frame and ggplot them: df=data.frame(x,y,pValues) library(ggplot2) ggplot(data=df) + aes(x=x, y=y, ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found