Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Optimization of database creation

See original GitHub issue

Is it possible to optimize database creation? It’d be interesting to see where the bottlenecks are. Would Cythonizing certain parts of the code help with this, or is the bottleneck is purely at the sqlite interface? For example, creation of a database for this GFF (Ensembl mouse genes) takes ~15 minutes:

http://genes.mit.edu/yarden/Mus_musculus.NCBIM37.65.gff
$ gffutils-cli create Mus_musculus.NCBIM37.65.gff

The bottleneck is there even when dbs are created in memory, though it’s of course smaller in these cases. This is related to another issue we discussed which is: how to create derivative GFF files from an existing ones, i.e. how to iterate through a GFF and make a new version of it. For example, if you wanted to iterate through the above GFF and simply add an optional key, value pair to the attributes of some records. This is a kind of “streaming” operation that can be done line-by-line and doesn’t necessarily need a db. The overhead of creating the db makes gffutils impractical for these kinds of simple operations on large-ish GFF files.

There are more sophisticated operations (i.e. non-streaming ones) where a naive in memory solution still is considerably faster because of the database creation bottleneck. For example, the naive GFF parser used in misopy.gff_utils (see load_genes_from_gff in https://github.com/yarden/MISO/blob/fastmiso/misopy/Gene.py) simply iterates through the GFF file multiple times to collect all the gene entries into a simple Gene object, with mRNAs represented as lists of exons. This kind of gene-centric in-memory solution is less expressive than gffutils (does not handle arbitrary nesting, and ignores non-genes basically) but for simple operations like “Add attribute X to all genes” or “Reformat the IDs of this GFF” it’s considerably faster. It’s not actually faster once the DB is created; gffutils retrieval is excellent, but the overhead of creation of the db trumps the computation time for many of these operations.

In summary, I’m wondering if there’s a way to try to bridge the gap between the fast, but hacky solutions and the sophisticated gffutils solution that comes with an overhead. I think this is an important issue because many of the operations done on GFFs (at least that I do) don’t require many hierarchical SQL queries.

Issue Analytics

State:
Created 10 years ago
Comments:42 (25 by maintainers)

Top GitHub Comments

2reactions

dalercommented, May 25, 2017

Wait, sorry, I thought this was related to the above comments. The file you link to is a GFF, so there is no gene or transcript inference by default. It appears the issue with this file is the duplicated IDs.

Databases need a unique ID for each feature. The ID field in this file is not unique – for example here are the features for ID=cds1:

NC_000001.11	Gnomon	CDS	182709	182746	.	+	0	ID=cds1;Parent=rna37;Dbxref=GeneID:102725121,Genbank:XP_011542110.1;Name=XP_011542110.1;gbkey=CDS;gene=LOC102725121;product=uncharacterized protein LOC102725121 isoform X1;protein_id=XP_011542110.1
NC_000001.11	Gnomon	CDS	183114	183240	.	+	1	ID=cds1;Parent=rna37;Dbxref=GeneID:102725121,Genbank:XP_011542110.1;Name=XP_011542110.1;gbkey=CDS;gene=LOC102725121;product=uncharacterized protein LOC102725121 isoform X1;protein_id=XP_011542110.1
NC_000001.11	Gnomon	CDS	183922	184158	.	+	0	ID=cds1;Parent=rna37;Dbxref=GeneID:102725121,Genbank:XP_011542110.1;Name=XP_011542110.1;gbkey=CDS;gene=LOC102725121;product=uncharacterized protein LOC102725121 isoform X1;protein_id=XP_011542110.1

If you use merge_strategy="merge", then gffutils assumes the lines refer to the same feature and so does a lot of work to merge the attributes in a nice way. Looking at this file though, the CDSs should definitely be considered different features.

You’ll need to decide how you want to be able to refer to CDSs. If you don’t really have an opinion on that, you can try the merge_strategy="create_unique" argument when creating the database which should speed things up considerably. The features above will then be called cds1, cds1.1, and cds1.2. Alternatively you can write a transform function to do arbitrary manipulation of the features before they get into the db, for example to create your own custom ID field based on the other attributes.

Give merge_strategy="create_unique" a try to see if it helps the speed issue. It still should run in <15 mins.

0reactions

irietcommented, May 26, 2017

The merge_strategy=“create_unique” worked well for what we want. Thank you kindly for the pointer.

Best Takeshi

On May 25, 2017, at 10:02 AM, Ryan Dale notifications@github.com wrote:

Wait, sorry, I thought this was related to the above comments. The file you link to is a GFF, so there is no gene or transcript inference by default. It appears the issue with this file is the duplicated IDs.

Databases need a unique ID for each feature. The ID field in this file is not unique – for example here are the features for ID=cds1:

NC_000001.11 Gnomon CDS 182709 182746 . + 0 ID=cds1;Parent=rna37;Dbxref=GeneID:102725121,Genbank:XP_011542110.1;Name=XP_011542110.1;gbkey=CDS;gene=LOC102725121;product=uncharacterized protein LOC102725121 isoform X1;protein_id=XP_011542110.1 NC_000001.11 Gnomon CDS 183114 183240 . + 1 ID=cds1;Parent=rna37;Dbxref=GeneID:102725121,Genbank:XP_011542110.1;Name=XP_011542110.1;gbkey=CDS;gene=LOC102725121;product=uncharacterized protein LOC102725121 isoform X1;protein_id=XP_011542110.1 NC_000001.11 Gnomon CDS 183922 184158 . + 0 ID=cds1;Parent=rna37;Dbxref=GeneID:102725121,Genbank:XP_011542110.1;Name=XP_011542110.1;gbkey=CDS;gene=LOC102725121;product=uncharacterized protein LOC102725121 isoform X1;protein_id=XP_011542110.1 If you use merge_strategy=“merge”, then gffutils assumes the lines refer to the same feature and so does a lot of work to merge the attributes in a nice way. Looking at this file though, the CDSs should definitely be considered different features.

You’ll need to decide how you want to be able to refer to CDSs. If you don’t really have an opinion on that, you can try the merge_strategy=“create_unique” argument when creating the database which should speed things up considerably. The features above will then be called cds1, cds1.1, and cds1.2. Alternatively you can write a transform function to do arbitrary manipulation of the features before they get into the db, for example to create your own custom ID field based on the other attributes.

Give merge_strategy=“create_unique” a try to see if it helps the speed issue. It still should run in <15 mins.

― You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

Top Results From Across the Web

10 Database Optimization Best Practices for Web Developers

10 Database Optimization Best Practices for Web Developers · 1. Use EXPLAIN with Your SELECT Queries · 2. Add Indexes to Searched Columns...

Techniques for Optimizing Databases

Creating the correct indexes on tables in the database is perhaps the single greatest performance tuning technique that a DBA can perform.

Query optimization techniques in SQL Server: Database ...

One of the best ways to optimize performance in a database is to design it right the first time! Making design and architecture...

6 ways to optimize your SQL database - devmio

6 ways to optimize your SQL database · 1. Proper indexing. Index is basically a data structure that helps speed up the data...

Supercharge Your SQL Queries for Production Databases

When querying a production database, optimization is key. An inefficient query will drain the production database's resources, and cause slow performance or ...