question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Missing scaffold genes

See original GitHub issue

I use pyensembl for gene id/name mapping but lately I noticed that some ensembl gene ids (e.g. ENSG00000285395) are missing in pyensembl.EnsemblRelease(97).genes().

That’s I think because pyensembl is using {species}.{reference}.{release}.gtf.gz GTF URL template instead of {species}.{reference}.{release}.chr_patch_hapl_scaff.gtf.gz (see https://ftp.ensembl.org/pub/release-97/gtf/homo_sapiens/ for comparison).

I understand that this file includes genes that are not mapped to chromosomes, so might be problematic in the context of pyensembl, but still, it’d be more complete to include all the genes.

Issue Analytics

  • State:open
  • Created 4 years ago
  • Reactions:1
  • Comments:5 (3 by maintainers)

github_iconTop GitHub Comments

1reaction
iskandrcommented, Feb 11, 2020

Thanks for pointing this out. Do you think it’s OK to just include the scaffold names as if they were chromosomes or do they need further special treatment?

Also, do you think that their inclusion in the results of genes() should be the default or optional?

0reactions
fabianeglicommented, Feb 12, 2020

I would say it is desirable to have specific logic in the code to cope with release versions of databases since they can change more than once which can result in a failure!=failure scenario.

Another reason to use database version specific path pattern hardcoding is that they are not expected to change and thus the hardcoding of path patterns serves a documentation purpose, too.

The last point would be performance - a try-except block might be slightly faster but loading a file is much slower than a case switch and/or path generation and thus I expect the performance difference of different selection methods to be negligible.

Read more comments on GitHub >

github_iconTop Results From Across the Web

False gene and chromosome losses in genome assemblies ...
Background. Many short-read genome assemblies have been found to be incomplete and contain mis-assemblies. The Vertebrate Genomes Project ...
Read more >
Genomes vs genNNNes: the difference between contigs and ...
Scaffolds miss critical information. Gaps represent missing genomic information and, in many cases, these gaps can coincide with important ...
Read more >
Mapping the Human Reference Genome's Missing Sequence ...
Genes missing from the human reference genome are often ... marker known through sequence alignment to localize within the scaffold.
Read more >
Discrepancies in BUSCO results between scaffold-scale and ...
15 BUSCO genes are missing before scaffolding and found after and more surprisingly 31 BUSCO genes are found before scaffolding and missing after....
Read more >
An improved approximation algorithm for scaffold filling to ...
The one-sided scaffold filling problem can be described as given an incomplete genome I and a complete (reference) genome G, fill the missing...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found