question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

ANALYZE features speeds computation up to 1000x ?

See original GitHub issue

Hi. I’ve been using gffutis with gencode v18 human genome annotation(http://www.gencodegenes.org/releases/current.html) and noticed the following: if I create database this way:

#!/usr/bin/python2
import gffutils as gff
import sys
db = gff.create_db('gencode.v19.annotation.gff3',
                   'gencode.v19.annotation.gff3.db',
                   force=True,
                   merge_strategy="merge")

And then run multiple calls to db.parent:

#!/usr/bin/python2
import gffutils as gff
import os.path
import sys

db = gff.FeatureDB('gencode.v19.annotation.gff3.db')
with open(os.path.join(os.path.dirname(__file__), '1000_transcripts.txt')) as transcripts:
        for t in transcripts:
                t = t.strip()
                for gene in db.parents(t, featuretype='gene'):
                        print gene

1000_transcripts.txt contains 1000 random transcripts to human genome. The computation time of the second piece of code from time bash command appears to be

real    0m31.850s
user    0m17.351s
sys     0m14.497s

But if I create database this way:

#!/usr/bin/python2
import gffutils as gff
import sys
db = gff.create_db('gencode.v19.annotation.gff3',
                   'gencode.v19.annotation.gff3.db',
                   force=True,
                   merge_strategy="merge")
db.execute('ANALYZE features')

Then multiple calls to parent take

real    0m0.089s
user    0m0.072s
sys     0m0.016s

However I can’t see this effect if call to parent is performed in the same script as database creation. I suggest #55 as a fix.

Issue Analytics

  • State:closed
  • Created 8 years ago
  • Reactions:3
  • Comments:6 (6 by maintainers)

github_iconTop GitHub Comments

1reaction
dalercommented, Aug 15, 2016

@mdshw5 @roryk my reading of https://www.sqlite.org/lang_analyze.html and https://www.sqlite.org/fileformat2.html#stat1tab is that the ANALYZE command gives approximations for optimizing queries and doesn’t seem to need updating unless a lot of things change. I think it would not be time-efficient to trigger an ANALYZE after every update to the db (analagous to __setitem__).

Bulk updates are typically done with the FeatureDB.update() method, and index creation and ANALYZE are run at the end of that (in the _finalize() call of the creator object). ANALYZE can now be manually triggered using FeatureDB.analyze(). If you come up with other cases where it should be run, please let me know (prob via a separate issue).

Also when loading an existing db, if the sqlite_stat1 table doesn’t exist then you get a UserWarning with a suggestion to run db.analyze() to speed things up.

xref 2c1cbc8cc

0reactions
dalercommented, Jul 27, 2016

#55 added the magic to db creation, but you’re right, the interface should check to see if the db has been analyzed yet and be smart about when to re-analyze. I’ll block out some time later this week to work on it.

Read more comments on GitHub >

github_iconTop Results From Across the Web

What 1000-X faster simulation means for digital twins
MIT researchers discovered a technique that speeds physics modeling by 1000X. They spun this out into a new company, called Akselos, ...
Read more >
Ask HN: What'd be possible with 1000x faster CPUs?
With GPUs we have proven that parallelism can be just as good or even better than speed increases in enhancing computation.
Read more >
PostgreSQL + TimescaleDB: 1,000x Faster Queries, 90 ...
TimescaleDB expands PostgreSQL query performance by 1000x, reduces storage utilization by 90% ... More Features to Speed Up Development Time.
Read more >
1000x speed to Google Colab using Techila Distributed ...
This demo shows how a Jupyter notebook in Google Colab can run at rocket speed using Techila Distributed Computing Engine.
Read more >
How to speed up tabular data processing by 1053x in pandas ...
In this article I will show you how to speed up data preprocessing by 1053x. We will achieve this by vectorizing a complex...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found