ConventionsSee original GitHub issue
You’re all invited because of your expertise writing single-cell pipelines. I think if we could all agree on a file format and some conventions, we could make our respective code much more interoperable. At a minimum, this would require each library to support reading and writing a common file format. I propose we go with Loom. 😃
But it will be extremely wasteful to write the whole dataset to file if you only added an attribute (e.g. a cluster ID). Better instead if each library just reads and writes the necessary attributes. How great wouldn’t it be if we could easily do this (and vice versa in R):
seurat = robjects.r['seurat']
pagoda = robjects.r['pagoda']
with loompy.connect("filename.loom") as ds:
(Of course, the code is made-up, but you get the idea)
To get there, we would need to agree on some conventions:
- Orientation of the dataset: rows versus columns
- Conventions for primary keys (maybe)
- Conventions for names of commonly used attributes
To get started, here’s a proposal based on our current conventions, slightly cleaned up:
Rows are genes, columns are cells.
Cluster (the cluster ID, int label from 0 to n_clusters)
CellID (string; unique, not really used, but assumed to be present)
Z (embedding, e.g. tSNE or PCA; float; Z is optional)
Valid (0 or 1, indicating valid cells)
Gene (gene symbol, string human-readable, not unique)
Accession (string, computer-readable, unique)
Selected (e.g. high-variance genes, 1 or 0)
Valid (passed some minimal criteria for being valid, 1 or 0)
Graphs on cells
KNN (the knn graph)
MKNN (the mutual KNN graph)
Graphs on genes
We also have conventions for layers, used e.g. for gene enrichment. For example, an aggregate Loom file would have one column per cluster instead of per cell, and would have layers such as
trinarization, which give various metrics per gene and cluster. Not sure if it makes sense to standardize those, but please let us know what your pipelines might need along these lines.
Note: I’m opening the discussion here, but these are really intended to be conventions for analysis libraries and visualization tools, not Loom the file format. Loom itself is intended to be more general, so we will not impose on it e.g. that there will always be a
Gene attribute. We may want to use loom for totally different omics datasets in the future.
- Created 6 years ago
- Comments:25 (16 by maintainers)
Top GitHub Comments
Picking up the thread again, I’m happy to announce Loom 2.0, an almost complete rewrite of loompy. Go to the release notes for the full list of changes. In short, v2.0 implements many of the features requested above (such as Unicode support, multidimensional attributes, and multidimensional global attributes).
It also supports a powerful new concept of in-memory views (essentially a slice through the file, including all layers, attributes, and graphs) which works great with the new
scan() method for out-of-memory algorithms.
It is much more Pythonic, with a uniform API for reading/writing/deleting attributes, graphs and layers.
It is more generous (allowing almost any kind of list, tuple or array to be assigned to attributes, or any kind of sparse or dense adjacency matrix for graphs). At the same time, it normalizes everything to conform to the file spec.
Finally, I have updated the file format specification to be more specific about the allowed datatypes and shapes.
All this, and it remains fully backwards compatible with your old loom files. I have used 2.0 for a couple of months already and it’s a significant step forward. Your code will be simpler, more expressive and more capable.
Note: your current code should mostly work without problems. You’ll get log messages for deprecated methods and attributes, but they can be safely ignored. There are two breaking changes, both mentioned in the release notes. After fixing instances of these two issues, our current analysis pipeline runs without error and uses nearly every feature of loompy 2.0, so I would consider this a reasonably stable release. Out-of-memory manifold learning and clustering of a 500k cells by 27k genes dataset with all the bells and whistles runs just fine on my MacBook with 16 GB RAM (admittedly, it takes a few hours)!
I’ve updated the docs, which are now (for technical reasons) hosted at http://linnarssonlab.org/loompy.
Thank you for this, Sten! 😃 Just before Christmas, Scanpy/AnnData incorporated .loom support: https://scanpy.readthedocs.io, (docs read, docs write, code read, code write) Unfortunately, still with the old loom version. We will work on making use of the possibilities of the new version, which will allow loosing less information when exporting to
What might be interesting for all efforts of storing sparse matrices in hdf5: I’ve worked on a more basic way of interfacing sparse matrices in hdf5 files, which works a bit different than in loom, but provides some advantages and additional functionality.
We mentioned before that we planned to move from a static to a dynamically-backed AnnData object from the very beginning. We did in the past weeks. AnnData now offers both: convenience of fancy indexing and all what pandas/numpy offer for accessing and manipulating data (“memory mode”) versus full hdf5-backing (“backed mode”), with reduced speed and possibilities of accessing and manipulating data. See my thoughts on this here. Given that loom didn’t satisfy all that we need for backing AnnData objects, we currently base this on a very similar hdf5-format (but with in addition support for categorical data types, arbitrary unstructured annotations, hdf5 sparse datasets), but will be happy to move away from this, once loom can fully back AnnData objects.
PS: A revised version of the Scanpy draft will appear in Genome Biology soon and we reference and acknowledge the discussion here and Sten’s comments there. It really helped us rethink AnnData’s hdf5-backing. PPS: I haven’t been able to post this neither here nor at another place earlier because my wife and I just got twins… 😄 So, I will be slow in responding. Maybe @flying-sheep can help out here and there, regarding anndata and Scanpy…