Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

ANALYZE features speeds computation up to 1000x ?

See original GitHub issue

Hi. I’ve been using gffutis with gencode v18 human genome annotation(http://www.gencodegenes.org/releases/current.html) and noticed the following: if I create database this way:

#!/usr/bin/python2
import gffutils as gff
import sys
db = gff.create_db('gencode.v19.annotation.gff3',
                   'gencode.v19.annotation.gff3.db',
                   force=True,
                   merge_strategy="merge")

And then run multiple calls to db.parent:

#!/usr/bin/python2
import gffutils as gff
import os.path
import sys

db = gff.FeatureDB('gencode.v19.annotation.gff3.db')
with open(os.path.join(os.path.dirname(__file__), '1000_transcripts.txt')) as transcripts:
        for t in transcripts:
                t = t.strip()
                for gene in db.parents(t, featuretype='gene'):
                        print gene

1000_transcripts.txt contains 1000 random transcripts to human genome. The computation time of the second piece of code from time bash command appears to be

real    0m31.850s
user    0m17.351s
sys     0m14.497s

But if I create database this way:

#!/usr/bin/python2
import gffutils as gff
import sys
db = gff.create_db('gencode.v19.annotation.gff3',
                   'gencode.v19.annotation.gff3.db',
                   force=True,
                   merge_strategy="merge")
db.execute('ANALYZE features')

Then multiple calls to parent take

real    0m0.089s
user    0m0.072s
sys     0m0.016s

However I can’t see this effect if call to parent is performed in the same script as database creation. I suggest #55 as a fix.

Issue Analytics

State:
Created 8 years ago
Reactions:3
Comments:6 (6 by maintainers)

Top GitHub Comments

1reaction

dalercommented, Aug 15, 2016

@mdshw5 @roryk my reading of https://www.sqlite.org/lang_analyze.html and https://www.sqlite.org/fileformat2.html#stat1tab is that the ANALYZE command gives approximations for optimizing queries and doesn’t seem to need updating unless a lot of things change. I think it would not be time-efficient to trigger an ANALYZE after every update to the db (analagous to __setitem__).

Bulk updates are typically done with the FeatureDB.update() method, and index creation and ANALYZE are run at the end of that (in the _finalize() call of the creator object). ANALYZE can now be manually triggered using FeatureDB.analyze(). If you come up with other cases where it should be run, please let me know (prob via a separate issue).

Also when loading an existing db, if the sqlite_stat1 table doesn’t exist then you get a UserWarning with a suggestion to run db.analyze() to speed things up.

xref 2c1cbc8cc

0reactions

dalercommented, Jul 27, 2016

#55 added the magic to db creation, but you’re right, the interface should check to see if the db has been analyzed yet and be smart about when to re-analyze. I’ll block out some time later this week to work on it.