ANALYZE features speeds computation up to 1000x ?
See original GitHub issueHi. I’ve been using gffutis with gencode v18 human genome annotation(http://www.gencodegenes.org/releases/current.html) and noticed the following: if I create database this way:
#!/usr/bin/python2
import gffutils as gff
import sys
db = gff.create_db('gencode.v19.annotation.gff3',
'gencode.v19.annotation.gff3.db',
force=True,
merge_strategy="merge")
And then run multiple calls to db.parent:
#!/usr/bin/python2
import gffutils as gff
import os.path
import sys
db = gff.FeatureDB('gencode.v19.annotation.gff3.db')
with open(os.path.join(os.path.dirname(__file__), '1000_transcripts.txt')) as transcripts:
for t in transcripts:
t = t.strip()
for gene in db.parents(t, featuretype='gene'):
print gene
1000_transcripts.txt contains 1000 random transcripts to human genome.
The computation time of the second piece of code from time
bash command appears to be
real 0m31.850s
user 0m17.351s
sys 0m14.497s
But if I create database this way:
#!/usr/bin/python2
import gffutils as gff
import sys
db = gff.create_db('gencode.v19.annotation.gff3',
'gencode.v19.annotation.gff3.db',
force=True,
merge_strategy="merge")
db.execute('ANALYZE features')
Then multiple calls to parent take
real 0m0.089s
user 0m0.072s
sys 0m0.016s
However I can’t see this effect if call to parent is performed in the same script as database creation. I suggest #55 as a fix.
Issue Analytics
- State:
- Created 8 years ago
- Reactions:3
- Comments:6 (6 by maintainers)
Top Results From Across the Web
What 1000-X faster simulation means for digital twins
MIT researchers discovered a technique that speeds physics modeling by 1000X. They spun this out into a new company, called Akselos, ...
Read more >Ask HN: What'd be possible with 1000x faster CPUs?
With GPUs we have proven that parallelism can be just as good or even better than speed increases in enhancing computation.
Read more >PostgreSQL + TimescaleDB: 1,000x Faster Queries, 90 ...
TimescaleDB expands PostgreSQL query performance by 1000x, reduces storage utilization by 90% ... More Features to Speed Up Development Time.
Read more >1000x speed to Google Colab using Techila Distributed ...
This demo shows how a Jupyter notebook in Google Colab can run at rocket speed using Techila Distributed Computing Engine.
Read more >How to speed up tabular data processing by 1053x in pandas ...
In this article I will show you how to speed up data preprocessing by 1053x. We will achieve this by vectorizing a complex...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
@mdshw5 @roryk my reading of https://www.sqlite.org/lang_analyze.html and https://www.sqlite.org/fileformat2.html#stat1tab is that the ANALYZE command gives approximations for optimizing queries and doesn’t seem to need updating unless a lot of things change. I think it would not be time-efficient to trigger an ANALYZE after every update to the db (analagous to
__setitem__
).Bulk updates are typically done with the
FeatureDB.update()
method, and index creation and ANALYZE are run at the end of that (in the_finalize()
call of the creator object). ANALYZE can now be manually triggered usingFeatureDB.analyze()
. If you come up with other cases where it should be run, please let me know (prob via a separate issue).Also when loading an existing db, if the
sqlite_stat1
table doesn’t exist then you get a UserWarning with a suggestion to rundb.analyze()
to speed things up.xref 2c1cbc8cc
#55 added the magic to db creation, but you’re right, the interface should check to see if the db has been analyzed yet and be smart about when to re-analyze. I’ll block out some time later this week to work on it.