Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Scala API for accessing FASTA reference sequences

See original GitHub issue

Anywhere in fgbio we need to access FASTA for reference genomes or similar we end up using HTSJDK’s ReferenceSequenceFile and related classes (ReferenceSequenceFileFactory, ReferenceSequence). These classes get the job done, and offer a fair amount of functionality, but they are old and their design is a little clunky, especially coming from scala.

We have previously generated scala APIs for concepts in HTSJDK for both SAM/BAM and for VCF. We’d now like to do that for reference sequences. I’m 99.9% sure the right way to do this is to wrap the HTSJDK classes and try to never expose them in the scala API (similar to the VCF wrapper, and less like the SAM/BAM wrapper).

Some things that would be nice in the design:

A more scala-esque API in general (use of apply() where it makes sense)
A unified API that makes sense for a) accessing an indexed file, b) accessing a non-index file, c) pulling ranges out of a sequence already loaded into memory
The ability to perform transforms on loaded sequences (e.g. force everything to be upper case, mask all lower case bases to Ns, etc.)
Clarity in the API about when you’re accessing metadata about a sequence vs. when you’re pulling sequence data into memory

It may also be that it’s time for us to think about having a scala implementation/wrapper of SequenceDictionary as that’s often used heavily in conjunction with reference sequences, and is also super clunky.

FWIW I also have code kicking around somewhere for various activities that might be useful to think about and/or pull into fgbio as use cases for this, including:

Invoking HTSJDK functions to generate .fai files
“Normalizing” fasta files to a standard line length
Building up a sequence dictionary from a FASTA and set of per-chromosome aliases

I would suggest starting this ticket by:

Reviewing existing implementations including HTSJDK’s, pyfaidx and any others you can think of
Reviewing how we use the existing HTSJDK implementations through fgbio and possibly elsewhere
Sketching out a high level design of the public API for review by myself and @nh13

@nh13 Anything to add?

Issue Analytics

State:
Created 3 years ago
Comments:6 (5 by maintainers)

Top GitHub Comments

1reaction

nh13commented, Jan 4, 2021

@mjhipp I think from your examples, it’s clear that the conversion should be cached outside the loop. This may not be immediately obvious when moving to the new API, so I agree with your warning for folks moving over. I think the action item here is to add a big warning to the documentation of the implicit converters to state that the conversion is expensive! If really a problem, we could think about caching the conversion inside the converter class.

@tfenne what do you think?

1reaction

tfennecommented, Apr 13, 2020

A few other thoughts in no particular order:

It will likely be useful to have val chr1 = refSeq(0) also that looks up by sequence index (in addition to by name)
ReferenceSequenceFile in HTSJDK, as well as out internal SAM and VCF APIs all use 1-based inclusive coordinates. We should do likewise here and that might impact the method names we use for sub-sequence access
Do we want a chr1.force or similar that would drag all of chr1 into memory? E.g. I might write code and know I’m going to access enough of a chromosome to make it worth loading into memory for sub-sequence access
I make frequency use of the ability to load the .dict or sequence dictionary that is by convention alongside the fasta in many cases. It contains some redundant info with the .fai and some unique information. We could either have an option to find and load it given a reference, or always load it if it’s there and include the information in the NewReferenceSequence object(s)
One thing that is in the dict that would be nice to have good use of is a list of aliases (see example below). It would be nice a) to be able to do refSeq("MT") and get back chrM and b) do refSeq("chrM").names or similar and get back Seq("chrM", "MT", "J01415.2", "NC_012920.1")

@SQ	SN:chrM	LN:16569	M5:c68f52674c9fb33aef52dcf399755519	AS:hg38	SP:Homo sapiens	UR:http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.fa.gz	AN:MT,J01415.2,NC_012920.1
@SQ	SN:chr1	LN:248956422	M5:2648ae1bacce4ec4b6cf337dcae37816	AS:hg38	SP:Homo sapiens	UR:http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.fa.gz	AN:1,CM000663.2,NC_000001.11
@SQ	SN:chr2	LN:242193529	M5:4bb4f82880a14111eb7327169ffb729b	AS:hg38	SP:Homo sapiens	UR:http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.fa.gz	AN:2,CM000664.2,NC_000002.12

Top Results From Across the Web

Loading data with the ADAMContext - bdgenomics.adam

From Parquet or any of the feature file formats using loadCoverage (Scala only). Reference sequences as a broadcastable ReferenceFile using loadReferenceFile , ...

GET sequence/region/:species/:region - Ensembl Rest API

Expand the sequence upstream of the sequence by this many basepairs. Only available when using genomic sequence type. -, 1000. format, Enum(fasta), Format...

Reference Sequence and Annotation Files - GitHub

Reference Sequence and Annotation Files. Supplementary genome reference files in addition to the basic FASTA and GTF as well as notes and commentary....

How can I prepare a FASTA file to use as reference

The GATK uses two files to access and safety check access to the reference files: a .dict dictionary of the contig names and...

Processing Genomic Data with Apache Spark (Big Data tutorial)

This sequence is usually known as 'the reference sequence', ... For Apache Spark Version , select Spark 2.0 (Auto-updating, Scala 2.11) .