Doesn't work with Ensemble GTF files?
See original GitHub issueProgram works fine with Gencode but not Ensemble? How do I get it to work?
Steps to reproduce
- wget ftp://ftp.ensembl.org/pub/release-97/gtf/homo_sapiens/Homo_sapiens.GRCh38.97.gtf.gz
- mkdir BUG
- pigz -d Homo_sapiens.GRCh38.97.gtf.gz > BUG/Homo_sapiens.GRCh38.97.gtf
- python3 SOFTWARE/gtex-pipeline/gene_model/collapse_annotation.py BUG/Homo_sapiens.GRCh38.97.gtf BUG/Homo_sapiens.GRCh38.97_compressed.gtf
Error
Traceback (most recent call last):
File "SOFTWARE/gtex-pipeline/gene_model/collapse_annotation.py", line 249, in <module>
annotation = Annotation(args.transcript_gtf)
File "SOFTWARE/gtex-pipeline/gene_model/collapse_annotation.py", line 68, in __init__
g = Gene(gene_id, attributes['gene_name'], attributes['gene_type'], chrom, strand, start_pos, end_pos)
KeyError: 'gene_type'
Issue Analytics
- State:
- Created 4 years ago
- Comments:6
Top Results From Across the Web
GFF/GTF File Format - Ensembl
The GFF (General Feature Format) format consists of one line per feature, each containing 9 columns of data, plus optional track definition lines....
Read more >Ensembl GTF format: isn't the tag "transcript_id" mandatory?
I try to run RNA-SeQC on my data using danRer10 gtf file downloaded from ensembl (latest version) as described here the format is...
Read more >Gene quantification does not work with Ensembl v.91 human ...
My top suggestion would be to do transcript -> gene abundance aggregation using tximport. The other option is to provide Salmon with a...
Read more >ensDbFromGtf error from gtf file downloaded from GENCODE.
I ran ensembldb using both the EnsDb I made from the Ensembl GRCh38 .gtf annotation file (v90) and the one I downloaded through...
Read more >Build a Custom Reference (cellranger mkref) -Software - Support
If your species of interest is not available from Ensembl, GTF and FASTA files from other sources can also work. Note that a...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Hi,
Thanks for the reproducible examples. The latest commit, 81bf691f2ae63a12d3c4c3d2da3943111729301a, should fix this
@francois-a sadly I have to report that there are several ensembl GTFs for which the script is still now working. I tested several species and it work for:
However the following species generate errors:
The error tends to be
KeyError: 'gene_name'
and as far I can tell from looking at some of the entries, is due to genes / entries with missinggene_names
, for instancegrep "ENSMMUG00000064799" Macaca_mulatta.Mmul_10.99.gtf
.I supposed a simple workaround would be to add the missing entry in the form
gene_name "";
and report the gene_ids troublesome genes to stderr or stdout. Just an idea. Btw, when I say missing gene_name, it could be any non-essential entry such as transcript_name.Below the reproducible example and generated errors:
Cheers.