Genotype encoding scheme
See original GitHub issueThe implementation of the parsimony method over in #125 has raised some troublesome issues.
- We need to be able to represent missing data
- If we want to provide a conventional cost matrix for e.g., Sankoff parsimony, then genotype encoding 0, 1, 2… needs to have a fixed interpretation
Currently, the genotypes output by the variants
method are encoded in a site-specific way, and the alleles
array is needed to decode the actual string representation. A genotype of 0 at one site might mean “A” and at another is might mean “ATTAAC”. This is fine and works well for producing data because (a) it gives us a lot of flexibility in terms of representing stuff like short indels and (b) always have the ancestral state encoded as 0, with other alleles numbered as you go down the tree is a useful property.
For inputting the data, I suggest we require the following to be provided:
- An input array of genotypes (np.uint8)
- A list of alleles
- For Sankoff parsimony, a len(alleles) x len(alleles) cost matrix (we can provide some higher level allele-> allele mapping, which would take the pain out of this? I.e.,
cost={"A": {"A": 0, "C": 0.25,...}...}
As I see it, there are then two options for specifying missing data: either -1 or len(alleles). Encoding as -1 seems nicer, but genotypes are currently encoded as uint8, so it would really be 255. Using 255 as the missing data value seems quite nasty to me, as we may need to add support for 16 bit genotypes at some point (already available in the C code), and suddenly 255 wouldn’t mean missing data any more. This seems quite ugly, and may lead to tricky bugs. So, if we really do want to use -1 to represent we should change the type of genotypes to int8, so that we can represent -1 properly. This would mean that we only have space for 128 alleles, and may mean needing to have 16 bit genotypes sooner. This would be a bit of hassle to implement, and may break some people’s code.
The other option is to use len(alleles) as the missing data value. The downside here is that you can’t scan the genotypes for missing data, without knowing the alleles.
Any thoughts?
Issue Analytics
- State:
- Created 5 years ago
- Comments:8 (8 by maintainers)
Right. It might be useful to address https://github.com/tskit-dev/tskit/issues/192 first, which should help enforce consistent behaviour. I’ll work it into a PR.
Seems like we just need a
tskit.MISSING_ALLELE
; and I think that setting this as -1 and changing to int8 seems fine.