question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Doesn't work with Ensemble GTF files?

See original GitHub issue

Program works fine with Gencode but not Ensemble? How do I get it to work?

Steps to reproduce

Error

Traceback (most recent call last):
  File "SOFTWARE/gtex-pipeline/gene_model/collapse_annotation.py", line 249, in <module>
    annotation = Annotation(args.transcript_gtf)
  File "SOFTWARE/gtex-pipeline/gene_model/collapse_annotation.py", line 68, in __init__
    g = Gene(gene_id, attributes['gene_name'], attributes['gene_type'], chrom, strand, start_pos, end_pos)
KeyError: 'gene_type'

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:6

github_iconTop GitHub Comments

1reaction
francois-acommented, Feb 25, 2020

Hi,

Thanks for the reproducible examples. The latest commit, 81bf691f2ae63a12d3c4c3d2da3943111729301a, should fix this

0reactions
adominguescommented, Feb 25, 2020

@francois-a sadly I have to report that there are several ensembl GTFs for which the script is still now working. I tested several species and it work for:

  • human
  • mouse
  • Caenorhabditis elegans (v99)
  • Danio rerio (v99)

However the following species generate errors:

  • Drosophila melanogaster
  • Saccharomyces cerevisiae
  • Macaca mulatta

The error tends to be KeyError: 'gene_name' and as far I can tell from looking at some of the entries, is due to genes / entries with missing gene_names, for instance grep "ENSMMUG00000064799" Macaca_mulatta.Mmul_10.99.gtf.

I supposed a simple workaround would be to add the missing entry in the form gene_name ""; and report the gene_ids troublesome genes to stderr or stdout. Just an idea. Btw, when I say missing gene_name, it could be any non-essential entry such as transcript_name.

Below the reproducible example and generated errors:

## Saccharomyces cerevisiae
wget ftp://ftp.ensembl.org/pub/release-99/gtf/saccharomyces_cerevisiae/Saccharomyces_cerevisiae.R64-1-1.99.gtf.gz
gunzip Saccharomyces_cerevisiae.R64-1-1.99.gtf.gz

./collapse_annotation.py Saccharomyces_cerevisiae.R64-1-1.99.gtf collapsed.gtf

Traceback (most recent call last):
  File "./collapse_annotation.py", line 266, in <module>
    annotation = Annotation(args.transcript_gtf)
  File "./collapse_annotation.py", line 73, in __init__
    g = Gene(gene_id, attributes['gene_name'], attributes['gene_type'], chrom, strand, start_pos, end_pos)
KeyError: 'gene_name'


## Drosophila melanogaster
wget ftp://ftp.ensembl.org/pub/release-99/gtf/drosophila_melanogaster/Drosophila_melanogaster.BDGP6.28.99.gtf.gz

gunzip Drosophila_melanogaster.BDGP6.28.99.gtf.gz

./collapse_annotation.py Drosophila_melanogaster.BDGP6.28.99.gtf collapsed.gtf

Traceback (most recent call last):
  File "./collapse_annotation.py", line 266, in <module>
    annotation = Annotation(args.transcript_gtf)
  File "./collapse_annotation.py", line 81, in __init__
    t = Transcript(attributes.pop('transcript_id'), attributes.pop('transcript_name'),
KeyError: 'transcript_name'


## Macaca mulatta
wget ftp://ftp.ensembl.org/pub/release-99/gtf/macaca_mulatta/Macaca_mulatta.Mmul_10.99.gtf.gz

gunzip Macaca_mulatta.Mmul_10.99.gtf.gz

./collapse_annotation.py Macaca_mulatta.Mmul_10.99.gtf collapsed.gtf

Traceback (most recent call last):
  File "./collapse_annotation.py", line 266, in <module>
    annotation = Annotation(args.transcript_gtf)
  File "./collapse_annotation.py", line 73, in __init__
    g = Gene(gene_id, attributes['gene_name'], attributes['gene_type'], chrom, strand, start_pos, end_pos)
KeyError: 'gene_name

Cheers.

Read more comments on GitHub >

github_iconTop Results From Across the Web

GFF/GTF File Format - Ensembl
The GFF (General Feature Format) format consists of one line per feature, each containing 9 columns of data, plus optional track definition lines....
Read more >
Ensembl GTF format: isn't the tag "transcript_id" mandatory?
I try to run RNA-SeQC on my data using danRer10 gtf file downloaded from ensembl (latest version) as described here the format is...
Read more >
Gene quantification does not work with Ensembl v.91 human ...
My top suggestion would be to do transcript -> gene abundance aggregation using tximport. The other option is to provide Salmon with a...
Read more >
ensDbFromGtf error from gtf file downloaded from GENCODE.
I ran ensembldb using both the EnsDb I made from the Ensembl GRCh38 .gtf annotation file (v90) and the one I downloaded through...
Read more >
Build a Custom Reference (cellranger mkref) -Software - Support
If your species of interest is not available from Ensembl, GTF and FASTA files from other sources can also work. Note that a...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found