question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Genotype encoding scheme

See original GitHub issue

The implementation of the parsimony method over in #125 has raised some troublesome issues.

  1. We need to be able to represent missing data
  2. If we want to provide a conventional cost matrix for e.g., Sankoff parsimony, then genotype encoding 0, 1, 2… needs to have a fixed interpretation

Currently, the genotypes output by the variants method are encoded in a site-specific way, and the alleles array is needed to decode the actual string representation. A genotype of 0 at one site might mean “A” and at another is might mean “ATTAAC”. This is fine and works well for producing data because (a) it gives us a lot of flexibility in terms of representing stuff like short indels and (b) always have the ancestral state encoded as 0, with other alleles numbered as you go down the tree is a useful property.

For inputting the data, I suggest we require the following to be provided:

  • An input array of genotypes (np.uint8)
  • A list of alleles
  • For Sankoff parsimony, a len(alleles) x len(alleles) cost matrix (we can provide some higher level allele-> allele mapping, which would take the pain out of this? I.e., cost={"A": {"A": 0, "C": 0.25,...}...}

As I see it, there are then two options for specifying missing data: either -1 or len(alleles). Encoding as -1 seems nicer, but genotypes are currently encoded as uint8, so it would really be 255. Using 255 as the missing data value seems quite nasty to me, as we may need to add support for 16 bit genotypes at some point (already available in the C code), and suddenly 255 wouldn’t mean missing data any more. This seems quite ugly, and may lead to tricky bugs. So, if we really do want to use -1 to represent we should change the type of genotypes to int8, so that we can represent -1 properly. This would mean that we only have space for 128 alleles, and may mean needing to have 16 bit genotypes sooner. This would be a bit of hassle to implement, and may break some people’s code.

The other option is to use len(alleles) as the missing data value. The downside here is that you can’t scan the genotypes for missing data, without knowing the alleles.

Any thoughts?

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Comments:8 (8 by maintainers)

github_iconTop GitHub Comments

1reaction
hyanwongcommented, Jun 21, 2019

Right. It might be useful to address https://github.com/tskit-dev/tskit/issues/192 first, which should help enforce consistent behaviour. I’ll work it into a PR.

1reaction
petrelharpcommented, Jun 18, 2019

Seems like we just need a tskit.MISSING_ALLELE; and I think that setting this as -1 and changing to int8 seems fine.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Illustration of the three different encoding schemes for SNP data.
Note that in the genotype dataset, all SNPs were encoded based on the additive model 46 . The scheme encodes SNP according to...
Read more >
Encoding Methods in Genetic Algorithm - GeeksforGeeks
Each gene encodes a trait, for example color of eyes. ... “dumb” process on the chromosome of the genotype; Fitness is measured in...
Read more >
Understanding and Choosing Genotypes | Genetic Algorithms ...
The type of encoding scheme you use is known as a genotype. The genotype of a chromosome tells you what the chromosome should...
Read more >
2278-6244 ENCODING SCHEMES IN GENETIC ALGORITHM
Keywords: Genetic Algorithm, encoding scheme, binary encoding, tree encoding, value ... The genotype is the collection of genes possessed by an.
Read more >
Encoding Scheme Issues For Open-Ended Arti cial Evolution
Abstract. This paper examines the ways in which the encoding scheme that governs how phenotypes develop from genotypes may be used to.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found