Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

scverse datastructure for AIRR data

See original GitHub issue

Now that scirpy is part of scverse, we could think of an improved data structure for scAIRR data. See also the discussion at https://github.com/theislab/scanpy/issues/1387.

The challenge with scAIRR data is that

1 cell can have n chains. Up to four of them are biologically meaningful but there could be more for technical reasons.
Each chain has a lot of fields. See the AIRR rearrangement standard.

The current pragmatic solution is to store all fields in adata.obs.

All columns from the airr rearrangement schema are repeated four times
Excess chains are serialized into JSON and stored in an extra column. These chains are not used by scirpy, but enable lossless conversions.
The downside is that there can easily be 100+ columns in adata.obs. Also serializing excess chains is not really elegant.
The advantage is that it works really well with scanpy, i.e. any AIRR variable can immediately be used for grouping, plotting etc.

New options are

mudata. AIRR data could be saved as a separate modality. Even if we keep the current reprepsentation of a wide data frame, it would at least not clutter the rest of adata.obs.
awkward array support. Allows storing an arbitrary number of values per row. See https://github.com/theislab/anndata/pull/647

The new representation should also aim at being a community standard for the scverse ecosystem and should build upon the AIRR rearrangement standard. Ideally, we could get additional stakeholders onboard, including conga, dandelion, tcrdist3 and possibly members of the AIRR community.

what’s the state of the AIRR single-cell schema? And what are its advantates over the rearrangement schema.

Issue Analytics

State:
Created 2 years ago
Reactions:4
Comments:41 (5 by maintainers)

Top GitHub Comments

2reactions

javhcommented, Jul 2, 2022

For what it’s worth, on the R side we’ve been using MultiAssayExperiments with a “rearrangement” experiment that includes the AIRR Rearrangement data for storing multimodal single-cell data that includes AIRR data, GEX, CITE-seq, and/or whatever people dream up. The AIRR assay data (equivalent to .X) is stored as a BumpyMatrix, specifically a BumpyDataFrameMatrix. We have a primary key (row names) derived from the locus field to easily separate out IGH/TRB, but I don’t think that’s necessary. Sample/cell level metadata goes into the colData (.obs) as is typical.

If I’m understanding the awkward array correctly, and I may not be, this would be the same as using the “record” array implementation to populate .X with an awkward array in a mudata object. A pandas DataFrame with multi-indexing seems like the most natural fit for working cellular Rearrangement data (eg, key on something like ['cell_id', 'locus', 'sequence_id']) and it looks like the conversion from awkward array to multi-indexed DataFrame is trivial. Unfortunately, my python is a bit rusty these days, so I could be misunderstanding.

PS: “We” is not the AIRR Standards WG in this case. I don’t think we should have an official opinion on implementation.

1reaction

grstcommented, Oct 11, 2022

I decided to go with the adata.obsm variant described in https://github.com/scverse/scirpy/issues/327#issuecomment-1238067096, storing all chains in a single dimension rather than making an additional dimension for loci. This makes IO agnostic of loci, which aren’t standardized or even a mandatory field in rearrangement data. Calling “primary” and “secondary” chains would then happen in a separate step, which would also allow to implement different strategies for chain ranking in the future.

If anyone has reservations against this approach, now would be a good time to speak up, otherwise it might be too late.

Top Results From Across the Web

Usage principles — scirpy documentation - scverse

Scirpy leverages the AnnData data structure which combines a gene expression matrix ( .X ), gene-level annotations ( .var ) and cell-level annotations ......

AIRR Data Representations

The schema defines the data model, field names, data types, and encodings for AIRR standard objects. Strict typing enables interoperability and data sharing ......

Gregor Sturm | grst@genomic.social (@grsturm) / Twitter

To help address this, we've created the scverse cookiecutter template: ... scverse datastructure for AIRR data · Issue #327 · scverse/scirpy.

29. Immune Receptor Profiling — Multimodal single-cell analysis

In other data formats entries (e.g. AIRR) will have similar, deviating names, however the underlying information remains similar.

iReceptor Architecture - Simon Fraser University

A more detailed diagram of the iReceptor architecture is given below. In this diagram, there are a set of distributed AIRR-seq repositories (the...