Optimization of database creation
See original GitHub issueIs it possible to optimize database creation? It’d be interesting to see where the bottlenecks are. Would Cythonizing certain parts of the code help with this, or is the bottleneck is purely at the sqlite interface? For example, creation of a database for this GFF (Ensembl mouse genes) takes ~15 minutes:
http://genes.mit.edu/yarden/Mus_musculus.NCBIM37.65.gff
$ gffutils-cli create Mus_musculus.NCBIM37.65.gff
The bottleneck is there even when dbs are created in memory, though it’s of course smaller in these cases. This is related to another issue we discussed which is: how to create derivative GFF files from an existing ones, i.e. how to iterate through a GFF and make a new version of it. For example, if you wanted to iterate through the above GFF and simply add an optional key, value pair to the attributes of some records. This is a kind of “streaming” operation that can be done line-by-line and doesn’t necessarily need a db. The overhead of creating the db makes gffutils impractical for these kinds of simple operations on large-ish GFF files.
There are more sophisticated operations (i.e. non-streaming ones) where a naive in memory solution still is considerably faster because of the database creation bottleneck. For example, the naive GFF parser used in misopy.gff_utils
(see load_genes_from_gff
in https://github.com/yarden/MISO/blob/fastmiso/misopy/Gene.py) simply iterates through the GFF file multiple times to collect all the gene entries into a simple Gene
object, with mRNAs represented as lists of exons. This kind of gene-centric in-memory solution is less expressive than gffutils (does not handle arbitrary nesting, and ignores non-genes basically) but for simple operations like “Add attribute X to all genes” or “Reformat the IDs of this GFF” it’s considerably faster. It’s not actually faster once the DB is created; gffutils retrieval is excellent, but the overhead of creation of the db trumps the computation time for many of these operations.
In summary, I’m wondering if there’s a way to try to bridge the gap between the fast, but hacky solutions and the sophisticated gffutils solution that comes with an overhead. I think this is an important issue because many of the operations done on GFFs (at least that I do) don’t require many hierarchical SQL queries.
Issue Analytics
- State:
- Created 10 years ago
- Comments:42 (25 by maintainers)
Top GitHub Comments
Wait, sorry, I thought this was related to the above comments. The file you link to is a GFF, so there is no gene or transcript inference by default. It appears the issue with this file is the duplicated IDs.
Databases need a unique ID for each feature. The
ID
field in this file is not unique – for example here are the features forID=cds1
:If you use
merge_strategy="merge"
, then gffutils assumes the lines refer to the same feature and so does a lot of work to merge the attributes in a nice way. Looking at this file though, the CDSs should definitely be considered different features.You’ll need to decide how you want to be able to refer to CDSs. If you don’t really have an opinion on that, you can try the
merge_strategy="create_unique"
argument when creating the database which should speed things up considerably. The features above will then be calledcds1
,cds1.1
, andcds1.2
. Alternatively you can write a transform function to do arbitrary manipulation of the features before they get into the db, for example to create your own custom ID field based on the other attributes.Give
merge_strategy="create_unique"
a try to see if it helps the speed issue. It still should run in <15 mins.The merge_strategy=“create_unique” worked well for what we want. Thank you kindly for the pointer.
Best Takeshi