question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Support for reference sequences

See original GitHub issue

For some applications it would be useful to know the reference sequence that a tree sequence coordinate space refers to. For example, with real data, we should (at a minimum) record the reference build (e.g., GRCh38) and the contig ID (e.g., chr22) associated with a tree sequence. Ideally, we would also like to be able to do things like:

for site in ts.sites():
     print(site.position, site.ref_allele)

As well as situations in which a well-known canonical reference is available, we may also have a one-off reference sequence that we wish to record, e.g., in simulations.

To support this, I suggest adding a reference section to the file store, with some fields. Roughly, these might looks like:

reference/build         -- e.g. GRCh38
reference/contig       -- e.g. chr22
reference/id              -- md5 hash of the sequence, c.f. refget
reference/sequence -- Actual sequence information

The references section would be itself optional (keeping backward compatability), and some fields would probably be optional within this section (for example, reference/sequence would definitely be optional — no point in storing GRCh38 chr1 over and over again). We can imagine having some mechanisms for automatically retrieving sequences using refget, but this isn’t at all necessary for a basic implementation.

At a high-level, we should try to follow any upstream standards as closely as possible, e.g. GA4GH refget and any others that are relevant.

Any thoughts @tskit-dev/all, @bhaller?

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Reactions:2
  • Comments:46 (34 by maintainers)

github_iconTop GitHub Comments

4reactions
petrelharpcommented, Mar 15, 2019

I think this is a great idea, and lean towards calling it just reference (although reference_sequence would also be fine). I don’t think it should be called ancestral_sequence, because I’d argue these shouldn’t be the same thing, necessarily. Here’s some reasons:

  • If we say they’re supposed to be the same, then the ancestral_state column of the Site table is supposed to match the positions in the reference sequence, which is annoying and redundant.
  • The reference sequence is usually not the ancestral sequence, in practice. (eg GRCh38), but is still useful.
  • The reference sequence could still be used by e.g. SLiM to store, essentially, the ancestral sequence.
2reactions
jeromekellehercommented, Mar 19, 2019

I don’t know how we’re going to manage it from the C perspective @petrelharp, and I’m afraid I just don’t have time to work on it right now. I don’t want to rush in changes to the C API without thinking through the consequences very carefully (particularly not with something this fundamental).

I think the best thing to do from a SLiM perspective is to reopen the kastore in append mode after tskit has written it out, and write the reference sequence/data key in. Should be ~10 lines of code.

Read more comments on GitHub >

github_iconTop Results From Across the Web

RefSeq: NCBI Reference Sequence Database - NIH
A comprehensive, integrated, non-redundant, well-annotated set of reference sequences including genomic, transcript, and protein. Using RefSeq. About RefSeq ...
Read more >
Reference sequence (RefSeq) database at NCBI
The RefSeq project augments these reference sequences with current knowledge including publications, functional features and informative nomenclature.
Read more >
Reference Sequences - ENCODE project
The ENCODE project uses Reference Genomes from NCBI or UCSC to provide a consistent framework for mapping high-throughput sequencing data.
Read more >
Sequence Alignment Help - IPD-IMGT/HLA Database
A complete list of reference sequences for each allele can be seen below. The reference sequence ... sequence as listed: HLA Alignment References...
Read more >
Reference Sequence - Sequencher
Unlike other sequences in a contig, the Reference Sequence does not ... sequences, but the forward sequences clearly support a “T” call.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found