Genomic coordinate systems
See original GitHub issueAre genomic positions in sgkit 0-based or 1-based? We should decide and document the coordinate system conventions that sgkit uses.
[Please correct any mistakes I’ve made below!]
When reading VCF, PLINK, and BGEN files we leave the position variables unchanged, so they use the convention of the underlying file format. These are as follows:
- VCF is 1-based, fully closed http://samtools.github.io/hts-specs/VCFv4.3.pdf
- PLINK (.bim) is 1-based https://www.cog-genomics.org/plink/2.0/formats#bim
- BGEN (I can’t find documentation saying which convention this uses)
In simulate_genotype_call_dataset
we create positions starting at 0, which implies they are 0-based. So this is inconsistent with reading the file formats above.
For comparison here are the conventions that other libraries and tools use.
- scikit-allel is 1-based. VCF POS is not changed, and functions are documented as being 1-based (example).
- tskit is 0-based. See discussion here.
- Glow is 0-based, at least for VCF. It converts VCF file positions to 0-based internally, (see this section marked ‘Important’).
- Hail is 1-based. See docs, and this warning.
Whichever convention we adopt, we will likely need to support multiple ways of specifying intervals for selection, or windowing, which may use either convention. (This article has a good summary of options.)
For example, chr20:1-100
(for selecting, see #25) is 1-based, fully closed (i.e. the end position is inclusive). On the other hand BED (which would be a useful option for specifying windows) uses 0-based, half-open intervals (i.e. the end position is exclusive).
Issue Analytics
- State:
- Created 3 years ago
- Comments:5
Top GitHub Comments
Intervals are a different issue I think - we should be opinionated in this case and do half-open like BED.
Tutorial:Cheat Sheet For One-Based Vs Zero-Based Coordinate Systems
Python CLI: https://github.com/griffithlab/convert_zero_one_based