Support for reference sequences
See original GitHub issueFor some applications it would be useful to know the reference sequence that a tree sequence coordinate space refers to. For example, with real data, we should (at a minimum) record the reference build (e.g., GRCh38) and the contig ID (e.g., chr22) associated with a tree sequence. Ideally, we would also like to be able to do things like:
for site in ts.sites():
print(site.position, site.ref_allele)
As well as situations in which a well-known canonical reference is available, we may also have a one-off reference sequence that we wish to record, e.g., in simulations.
To support this, I suggest adding a reference
section to the file store, with some fields. Roughly, these might looks like:
reference/build -- e.g. GRCh38
reference/contig -- e.g. chr22
reference/id -- md5 hash of the sequence, c.f. refget
reference/sequence -- Actual sequence information
The references section would be itself optional (keeping backward compatability), and some fields would probably be optional within this section (for example, reference/sequence
would definitely be optional — no point in storing GRCh38 chr1 over and over again). We can imagine having some mechanisms for automatically retrieving sequences using refget, but this isn’t at all necessary for a basic implementation.
At a high-level, we should try to follow any upstream standards as closely as possible, e.g. GA4GH refget and any others that are relevant.
Any thoughts @tskit-dev/all, @bhaller?
Issue Analytics
- State:
- Created 5 years ago
- Reactions:2
- Comments:46 (34 by maintainers)
I think this is a great idea, and lean towards calling it just
reference
(althoughreference_sequence
would also be fine). I don’t think it should be calledancestral_sequence
, because I’d argue these shouldn’t be the same thing, necessarily. Here’s some reasons:ancestral_state
column of the Site table is supposed to match the positions in the reference sequence, which is annoying and redundant.I don’t know how we’re going to manage it from the C perspective @petrelharp, and I’m afraid I just don’t have time to work on it right now. I don’t want to rush in changes to the C API without thinking through the consequences very carefully (particularly not with something this fundamental).
I think the best thing to do from a SLiM perspective is to reopen the kastore in append mode after tskit has written it out, and write the
reference sequence/data
key in. Should be ~10 lines of code.