Support for -`Name, -`Key`, and -`SearchKey` global attributes
See original GitHub issueThis is a proposal to generalise the default schema we currently use internally.
Paging @slinnarsson, @gioelelm and @pl-ki for feedback. If everything is OK I can implement this, and Peter will be able to use it in his updates to the pipeline.
Support for global attributes ..Name
, ..Key
, and ..SearchKey
Currently, we use Accession
as the “row key attribute”, Gene
as the default “row search key” attribute, and CellID
as both “column key” attribute and “column search key” attribute. None of this is written down in the documentation.
It also ties the structure of our loom files to the current pipeline. Perhaps these labels will change in the future - they have already changed a few times internally in the past!
Suggestion: add global attributes that state which of the row/column attributes are the key attributes. That way, writing scripts for the pipeline will be more “future proof”, since it won’t have to worry about changing labels.
Similarly, we assume rows represents genes, and columns represents cells, but people might use loom for different purposes. I suggest adding a name attribute for that.
For backwards compatibility, we can use the above defaults as fall-backs if these attributes are missing.
The proposed schema ends up as:
-
rowName
(optional, is used to label the scatterplot view inloom-viewer
. Defaults to “Genes” ifGene
is present as a row attribute, “Rows” otherwise) -
rowKey
(optional, must contain a unique value for each row. Defaults toAccession
if it is present in row attributes, uses row index numbers otherwise) -
rowSearchKey
(optional, defaults toGene
if it is present as a row attributes, uses row index numbers otherwise) -
colName
(optional, is used to label the scatterplot view inloom-viewer
. Defaults to “Cells” ifCellID
is present in column attributes, “Columns” otherwise) -
colKey
(optional, must contain a unique value for each column. Defaults toCellID
if it is present in column attributes, uses column index numbers otherwise) -
colSearchkey
(optional, defaults toCellID
if it is present in column attributes, uses column index numbers otherwise)
Using rowSearchKey
and colSearchKey
for row/column getter functions
Currently, the documentation suggests the following way to access a row/column by Gene or CellID:
>>> ds[np.logical_or(ds.Gene == "Actb", ds.Gene == "Gapdh"),:]
array([[ 2., 9., 9., ..., 0., 14., 0.],
[ 0., 1., 4., ..., 0., 14., 3.]], dtype=float32)
>>> ds[:, ds.CellID == "AAACATACATTCTC-1"]
array([[ 0.],
[ 0.],
[ 0.],
...,
[ 0.],
[ 0.],
[ 0.]], dtype=float32)
I think it would be convenient to create a helper function that would wrap this NumPy logic:
>>> ds.getRows(["Actb", "Gapdh"])
array([[ 2., 9., 9., ..., 0., 14., 0.],
[ 0., 1., 4., ..., 0., 14., 3.]], dtype=float32)
>>> ds.getColumns(["AAACATACATTCTC-1"])
array([[ 0.],
[ 0.],
[ 0.],
...,
[ 0.],
[ 0.],
[ 0.]], dtype=float32)
These would use the rowSearchKey
and colSearchKey
attributes for choosing a default attribute, with the option for users to override this: ds.getRows(list, searchKey=self.rowSearchKey)
TODO list if this is all approved:
- add
rowName
,rowKey
,rowSearchKey
,colName
,colKey
andcolSearchKey
support to loom creation methods-
create
-
create_from_cellranger
-
_create_sparse
-
- add
getRows
method toLoomConnection
- add
getColumns
method toLoomConnection
Issue Analytics
- State:
- Created 6 years ago
- Comments:5 (5 by maintainers)
Top GitHub Comments
For uniformity with the python ecosystem and in particular with the widely used package for tabular formats
pandas
I would suggest against calling the special indexing attributegetRows
andgetCols
and I would suggest instead.loc
and.iloc
. See hereI don’t think this is a good idea, for several reasons:
You’re creating a lot of new syntax that only works with special attributes. All other attributes will still have to be accessed using numpy-style fancy indexing. This will be very confusing.
you’re borrowing syntax from pandas. This will create the expectation that loompy works like pandas. Will loc and iloc support all five allowed (by pandas) arguments? That’s a lot of new code to support and debug.
by creating two different ways of achieving the same effect, one of which only works in special cases, you’re making loompy harder to learn. Every code example that uses loc/iloc will not transfer to any other attribute except the specially designated ones.
if the concern is specifically to support getting rows (or columns) given a set of values, numpy already has a fine syntax for that:
This has the virtue of working the same for all attributes. Plus,
isin
is just one of many set operations in numpy, so it is also much more powerful.if the concern is to make it easy for pandas-users to learn loompy, we should instead make a tutorial that teaches the equivalent loompy idioms
all the new syntax will be broken on loom files that do not have the rowSearchKey attribute (currently, 100% of all loom files). Before using loc/iloc you’ll have to check for (or set) rowSearchKey (every time!). That’s a lot of boilerplate code. People won’t do it and their code will break unexpectedly.
this also makes me reconsider the rowKeys (etc) attributes. I definitely don’t think those should be loompy standards, because we will then always be tempted to rely on them. They might be loom-viewer standards though.