scverse datastructure for AIRR data
See original GitHub issueNow that scirpy is part of scverse, we could think of an improved data structure for scAIRR data. See also the discussion at https://github.com/theislab/scanpy/issues/1387.
The challenge with scAIRR data is that
1
cell can haven
chains. Up to four of them are biologically meaningful but there could be more for technical reasons.- Each chain has a lot of fields. See the AIRR rearrangement standard.
The current pragmatic solution is to store all fields in adata.obs
.
- All columns from the airr rearrangement schema are repeated four times
- Excess chains are serialized into JSON and stored in an extra column. These chains are not used by scirpy, but enable lossless conversions.
- The downside is that there can easily be 100+ columns in
adata.obs
. Also serializing excess chains is not really elegant. - The advantage is that it works really well with scanpy, i.e. any AIRR variable can immediately be used for grouping, plotting etc.
New options are
- mudata. AIRR data could be saved as a separate modality. Even if we keep the current reprepsentation of a wide data frame, it would at least not clutter the rest of
adata.obs
. - awkward array support. Allows storing an arbitrary number of values per row. See https://github.com/theislab/anndata/pull/647
The new representation should also aim at being a community standard for the scverse ecosystem and should build upon the AIRR rearrangement standard. Ideally, we could get additional stakeholders onboard, including conga, dandelion, tcrdist3 and possibly members of the AIRR community.
- what’s the state of the AIRR single-cell schema? And what are its advantates over the rearrangement schema.
Issue Analytics
- State:
- Created 2 years ago
- Reactions:4
- Comments:41 (5 by maintainers)
Top Results From Across the Web
Usage principles — scirpy documentation - scverse
Scirpy leverages the AnnData data structure which combines a gene expression matrix ( .X ), gene-level annotations ( .var ) and cell-level annotations ......
Read more >AIRR Data Representations
The schema defines the data model, field names, data types, and encodings for AIRR standard objects. Strict typing enables interoperability and data sharing ......
Read more >Gregor Sturm | grst@genomic.social (@grsturm) / Twitter
To help address this, we've created the scverse cookiecutter template: ... scverse datastructure for AIRR data · Issue #327 · scverse/scirpy.
Read more >29. Immune Receptor Profiling — Multimodal single-cell analysis
In other data formats entries (e.g. AIRR) will have similar, deviating names, however the underlying information remains similar.
Read more >iReceptor Architecture - Simon Fraser University
A more detailed diagram of the iReceptor architecture is given below. In this diagram, there are a set of distributed AIRR-seq repositories (the...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
For what it’s worth, on the R side we’ve been using MultiAssayExperiments with a “rearrangement” experiment that includes the AIRR Rearrangement data for storing multimodal single-cell data that includes AIRR data, GEX, CITE-seq, and/or whatever people dream up. The AIRR assay data (equivalent to
.X
) is stored as a BumpyMatrix, specifically aBumpyDataFrameMatrix
. We have a primary key (row names) derived from thelocus
field to easily separate out IGH/TRB, but I don’t think that’s necessary. Sample/cell level metadata goes into the colData (.obs
) as is typical.If I’m understanding the awkward array correctly, and I may not be, this would be the same as using the “record” array implementation to populate
.X
with an awkward array in a mudata object. A pandas DataFrame with multi-indexing seems like the most natural fit for working cellular Rearrangement data (eg, key on something like['cell_id', 'locus', 'sequence_id']
) and it looks like the conversion from awkward array to multi-indexed DataFrame is trivial. Unfortunately, my python is a bit rusty these days, so I could be misunderstanding.PS: “We” is not the AIRR Standards WG in this case. I don’t think we should have an official opinion on implementation.
I decided to go with the
adata.obsm
variant described in https://github.com/scverse/scirpy/issues/327#issuecomment-1238067096, storing all chains in a single dimension rather than making an additional dimension for loci. This makes IO agnostic of loci, which aren’t standardized or even a mandatory field in rearrangement data. Calling “primary” and “secondary” chains would then happen in a separate step, which would also allow to implement different strategies for chain ranking in the future.If anyone has reservations against this approach, now would be a good time to speak up, otherwise it might be too late.