question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Genomic coordinate systems

See original GitHub issue

Are genomic positions in sgkit 0-based or 1-based? We should decide and document the coordinate system conventions that sgkit uses.

[Please correct any mistakes I’ve made below!]

When reading VCF, PLINK, and BGEN files we leave the position variables unchanged, so they use the convention of the underlying file format. These are as follows:

In simulate_genotype_call_dataset we create positions starting at 0, which implies they are 0-based. So this is inconsistent with reading the file formats above.

For comparison here are the conventions that other libraries and tools use.

Whichever convention we adopt, we will likely need to support multiple ways of specifying intervals for selection, or windowing, which may use either convention. (This article has a good summary of options.)

For example, chr20:1-100 (for selecting, see #25) is 1-based, fully closed (i.e. the end position is inclusive). On the other hand BED (which would be a useful option for specifying windows) uses 0-based, half-open intervals (i.e. the end position is exclusive).

Issue Analytics

  • State:open
  • Created 3 years ago
  • Comments:5

github_iconTop GitHub Comments

1reaction
jeromekellehercommented, Jan 5, 2021

Intervals are a different issue I think - we should be opinionated in this case and do half-open like BED.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Coordinate systems used in genomics — plastid documentation
In a fully-closed or end-inclusive coordinate system, positions are inclusive: the end coordinate corresponds to the last position IN the feature. · In...
Read more >
The UCSC Genome Browser Coordinate Counting Systems
The UCSC Genome Browser uses two different systems: “1-start, fully-closed” = coordinates positioned within the web-based UCSC Genome Browser.
Read more >
The devil 0 and 1 coordinate systems in genomics
We need to be aware that there are two genomics coordinate systems: 1 based and 0 based. There is really no mystery between...
Read more >
Coordinates and intervals in graph-based reference genomes
Formally, a reference genome coordinate system is a system that uses coordinates to uniquely determine the positions of bases in the ...
Read more >
What human genome assembly and coordinate ... - Ensembl
Ensembl uses a one-based coordinate system, whereas UCSC uses a zero-based coordinate system. Ensembl uses the most recently updated human genome housed at ......
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found