Autoencoder-based sequence embedding
See original GitHub issueDescription of feature
IMO autoencoder-based sequence embedding has a huge potential for finding similar immune receptors, potentially improving both the speed and the accuracy compared to alignment-based metrics. In particular, finding similar sequences is important in two scirpy functions:
- defining clonotypes
- querying immune receptor databases.
For the database query, an online-update algorithm similar to scArches for gene expression would be nice: The autoencoder could be trained on the database (which might have millions of unique receptors) once. A new dataset (which might only have 10k-100k unique receptors), could be projected into the same latent space as the database, significantly improving query time.
An extension to this idea is to embed gene expression and TCR/BCR data into the same latent space.
Existing tools
- Trex by @ncborcherding. Based on
keras
. - mvTCR by @b-schubert’s lab. Combines receptor/Gex data. Based on
pytorch
. - TESSA. Combines receptor/Gex data. Not even sure it’s an autoencoder, need yet to check in detail, but it seems to use some clever sequence embeddings.
- There are likely more…
@drEast mentioned he is working on something like that a few months ago. Are you willing to share a few details and if you would be interested in integrating it with scirpy?
@adamgayoso, any chance there’s AirrVI
soon? 😜
Issue Analytics
- State:
- Created a year ago
- Comments:7
Re future directions:
Hello everyone, and thanks for this great exchange! 😃
For the HLA types, it would be great to keep track somehow of their sequence similarity. We could also consider their level of expression, at least broadly assigned to HLA-A, HLA-B, and HLA-C genes, whereas allele-specific expression would be hard to derive from 10x data. But @b-schubert you are definitely the expert here 😃