Missing scaffold genes
See original GitHub issueI use pyensembl for gene id/name mapping but lately I noticed that some ensembl gene ids (e.g. ENSG00000285395) are missing in pyensembl.EnsemblRelease(97).genes()
.
That’s I think because pyensembl is using {species}.{reference}.{release}.gtf.gz
GTF URL template instead of {species}.{reference}.{release}.chr_patch_hapl_scaff.gtf.gz
(see https://ftp.ensembl.org/pub/release-97/gtf/homo_sapiens/ for comparison).
I understand that this file includes genes that are not mapped to chromosomes, so might be problematic in the context of pyensembl, but still, it’d be more complete to include all the genes.
Issue Analytics
- State:
- Created 4 years ago
- Reactions:1
- Comments:5 (3 by maintainers)
Top Results From Across the Web
False gene and chromosome losses in genome assemblies ...
Background. Many short-read genome assemblies have been found to be incomplete and contain mis-assemblies. The Vertebrate Genomes Project ...
Read more >Genomes vs genNNNes: the difference between contigs and ...
Scaffolds miss critical information. Gaps represent missing genomic information and, in many cases, these gaps can coincide with important ...
Read more >Mapping the Human Reference Genome's Missing Sequence ...
Genes missing from the human reference genome are often ... marker known through sequence alignment to localize within the scaffold.
Read more >Discrepancies in BUSCO results between scaffold-scale and ...
15 BUSCO genes are missing before scaffolding and found after and more surprisingly 31 BUSCO genes are found before scaffolding and missing after....
Read more >An improved approximation algorithm for scaffold filling to ...
The one-sided scaffold filling problem can be described as given an incomplete genome I and a complete (reference) genome G, fill the missing...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Thanks for pointing this out. Do you think it’s OK to just include the scaffold names as if they were chromosomes or do they need further special treatment?
Also, do you think that their inclusion in the results of
genes()
should be the default or optional?I would say it is desirable to have specific logic in the code to cope with release versions of databases since they can change more than once which can result in a failure!=failure scenario.
Another reason to use database version specific path pattern hardcoding is that they are not expected to change and thus the hardcoding of path patterns serves a documentation purpose, too.
The last point would be performance - a try-except block might be slightly faster but loading a file is much slower than a case switch and/or path generation and thus I expect the performance difference of different selection methods to be negligible.