Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Genotype encoding scheme

See original GitHub issue

The implementation of the parsimony method over in #125 has raised some troublesome issues.

We need to be able to represent missing data
If we want to provide a conventional cost matrix for e.g., Sankoff parsimony, then genotype encoding 0, 1, 2… needs to have a fixed interpretation

Currently, the genotypes output by the variants method are encoded in a site-specific way, and the alleles array is needed to decode the actual string representation. A genotype of 0 at one site might mean “A” and at another is might mean “ATTAAC”. This is fine and works well for producing data because (a) it gives us a lot of flexibility in terms of representing stuff like short indels and (b) always have the ancestral state encoded as 0, with other alleles numbered as you go down the tree is a useful property.

For inputting the data, I suggest we require the following to be provided:

An input array of genotypes (np.uint8)
A list of alleles
For Sankoff parsimony, a len(alleles) x len(alleles) cost matrix (we can provide some higher level allele-> allele mapping, which would take the pain out of this? I.e., cost={"A": {"A": 0, "C": 0.25,...}...}

As I see it, there are then two options for specifying missing data: either -1 or len(alleles). Encoding as -1 seems nicer, but genotypes are currently encoded as uint8, so it would really be 255. Using 255 as the missing data value seems quite nasty to me, as we may need to add support for 16 bit genotypes at some point (already available in the C code), and suddenly 255 wouldn’t mean missing data any more. This seems quite ugly, and may lead to tricky bugs. So, if we really do want to use -1 to represent we should change the type of genotypes to int8, so that we can represent -1 properly. This would mean that we only have space for 128 alleles, and may mean needing to have 16 bit genotypes sooner. This would be a bit of hassle to implement, and may break some people’s code.

The other option is to use len(alleles) as the missing data value. The downside here is that you can’t scan the genotypes for missing data, without knowing the alleles.

Any thoughts?

Issue Analytics

State:
Created 5 years ago
Comments:8 (8 by maintainers)

Top GitHub Comments

1reaction

hyanwongcommented, Jun 21, 2019

Right. It might be useful to address https://github.com/tskit-dev/tskit/issues/192 first, which should help enforce consistent behaviour. I’ll work it into a PR.

1reaction

petrelharpcommented, Jun 18, 2019

Seems like we just need a tskit.MISSING_ALLELE; and I think that setting this as -1 and changing to int8 seems fine.