Scala API for accessing FASTA reference sequences
See original GitHub issueAnywhere in fgbio we need to access FASTA for reference genomes or similar we end up using HTSJDK’s ReferenceSequenceFile
and related classes (ReferenceSequenceFileFactory
, ReferenceSequence
). These classes get the job done, and offer a fair amount of functionality, but they are old and their design is a little clunky, especially coming from scala.
We have previously generated scala APIs for concepts in HTSJDK for both SAM/BAM and for VCF. We’d now like to do that for reference sequences. I’m 99.9% sure the right way to do this is to wrap the HTSJDK classes and try to never expose them in the scala API (similar to the VCF wrapper, and less like the SAM/BAM wrapper).
Some things that would be nice in the design:
- A more scala-esque API in general (use of
apply()
where it makes sense) - A unified API that makes sense for a) accessing an indexed file, b) accessing a non-index file, c) pulling ranges out of a sequence already loaded into memory
- The ability to perform transforms on loaded sequences (e.g. force everything to be upper case, mask all lower case bases to Ns, etc.)
- Clarity in the API about when you’re accessing metadata about a sequence vs. when you’re pulling sequence data into memory
It may also be that it’s time for us to think about having a scala implementation/wrapper of SequenceDictionary
as that’s often used heavily in conjunction with reference sequences, and is also super clunky.
FWIW I also have code kicking around somewhere for various activities that might be useful to think about and/or pull into fgbio as use cases for this, including:
- Invoking HTSJDK functions to generate
.fai
files - “Normalizing” fasta files to a standard line length
- Building up a sequence dictionary from a FASTA and set of per-chromosome aliases
I would suggest starting this ticket by:
- Reviewing existing implementations including HTSJDK’s, pyfaidx and any others you can think of
- Reviewing how we use the existing HTSJDK implementations through fgbio and possibly elsewhere
- Sketching out a high level design of the public API for review by myself and @nh13
@nh13 Anything to add?
Issue Analytics
- State:
- Created 3 years ago
- Comments:6 (5 by maintainers)
Top GitHub Comments
@mjhipp I think from your examples, it’s clear that the conversion should be cached outside the loop. This may not be immediately obvious when moving to the new API, so I agree with your warning for folks moving over. I think the action item here is to add a big warning to the documentation of the implicit converters to state that the conversion is expensive! If really a problem, we could think about caching the conversion inside the converter class.
@tfenne what do you think?
A few other thoughts in no particular order:
val chr1 = refSeq(0)
also that looks up by sequence index (in addition to by name)chr1.force
or similar that would drag all of chr1 into memory? E.g. I might write code and know I’m going to access enough of a chromosome to make it worth loading into memory for sub-sequence access.dict
or sequence dictionary that is by convention alongside the fasta in many cases. It contains some redundant info with the.fai
and some unique information. We could either have an option to find and load it given a reference, or always load it if it’s there and include the information in theNewReferenceSequence
object(s)refSeq("MT")
and get backchrM
and b) dorefSeq("chrM").names
or similar and get backSeq("chrM", "MT", "J01415.2", "NC_012920.1")