Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Support for reference sequences

See original GitHub issue

For some applications it would be useful to know the reference sequence that a tree sequence coordinate space refers to. For example, with real data, we should (at a minimum) record the reference build (e.g., GRCh38) and the contig ID (e.g., chr22) associated with a tree sequence. Ideally, we would also like to be able to do things like:

for site in ts.sites():
     print(site.position, site.ref_allele)

As well as situations in which a well-known canonical reference is available, we may also have a one-off reference sequence that we wish to record, e.g., in simulations.

To support this, I suggest adding a reference section to the file store, with some fields. Roughly, these might looks like:

reference/build         -- e.g. GRCh38
reference/contig       -- e.g. chr22
reference/id              -- md5 hash of the sequence, c.f. refget
reference/sequence -- Actual sequence information

The references section would be itself optional (keeping backward compatability), and some fields would probably be optional within this section (for example, reference/sequence would definitely be optional — no point in storing GRCh38 chr1 over and over again). We can imagine having some mechanisms for automatically retrieving sequences using refget, but this isn’t at all necessary for a basic implementation.

At a high-level, we should try to follow any upstream standards as closely as possible, e.g. GA4GH refget and any others that are relevant.

Any thoughts @tskit-dev/all, @bhaller?

Issue Analytics

State:
Created 5 years ago
Reactions:2
Comments:46 (34 by maintainers)

Top GitHub Comments

4reactions

petrelharpcommented, Mar 15, 2019

I think this is a great idea, and lean towards calling it just reference (although reference_sequence would also be fine). I don’t think it should be called ancestral_sequence, because I’d argue these shouldn’t be the same thing, necessarily. Here’s some reasons:

If we say they’re supposed to be the same, then the ancestral_state column of the Site table is supposed to match the positions in the reference sequence, which is annoying and redundant.
The reference sequence is usually not the ancestral sequence, in practice. (eg GRCh38), but is still useful.
The reference sequence could still be used by e.g. SLiM to store, essentially, the ancestral sequence.

2reactions

jeromekellehercommented, Mar 19, 2019

I don’t know how we’re going to manage it from the C perspective @petrelharp, and I’m afraid I just don’t have time to work on it right now. I don’t want to rush in changes to the C API without thinking through the consequences very carefully (particularly not with something this fundamental).

I think the best thing to do from a SLiM perspective is to reopen the kastore in append mode after tskit has written it out, and write the reference sequence/data key in. Should be ~10 lines of code.

Top Results From Across the Web

RefSeq: NCBI Reference Sequence Database - NIH

A comprehensive, integrated, non-redundant, well-annotated set of reference sequences including genomic, transcript, and protein. Using RefSeq. About RefSeq ...

Reference sequence (RefSeq) database at NCBI

The RefSeq project augments these reference sequences with current knowledge including publications, functional features and informative nomenclature.

Reference Sequences - ENCODE project

The ENCODE project uses Reference Genomes from NCBI or UCSC to provide a consistent framework for mapping high-throughput sequencing data.

Sequence Alignment Help - IPD-IMGT/HLA Database

A complete list of reference sequences for each allele can be seen below. The reference sequence ... sequence as listed: HLA Alignment References...

Reference Sequence - Sequencher

Unlike other sequences in a contig, the Reference Sequence does not ... sequences, but the forward sequences clearly support a “T” call.