Edge case issue with simplify and missing data
See original GitHub issueI’ve just hit an edge case which is preventing me round-tripping some missing data examples. If we have an extreme edge of the genome in which only a single sample has non-missing data, then this can be represented by a tree at that point with only a single branch, connecting that sample to the root. However, if we run simplify()
on such a tree sequence, the edge is removed (as it only contains unary nodes). That leaves the sample as an “isolated node”, and hence the missing data code in https://github.com/tskit-dev/tskit/pull/272/ flags it up as a case where the genotype should be set to -1
, even though in this case, we do have information to properly encode the genotype.
I’m wondering if this is a issue with the missing data code, or the simplify()
code? For example, in simplify()
it might be considered reasonable not to drop unary nodes from a sample if they connect that sample to the root? But I’m not sure how the root would be identified in this case.
Ping @jeromekelleher and @petrelharp as they are the simplifying and missing data gurus 😃
Issue Analytics
- State:
- Created 4 years ago
- Comments:17 (17 by maintainers)
Hm: I think that simplify is definately doing the right thing, as originally defined. That edge isn’t reflecting a genealogical relationship between the samples, which is how we’ve defined things.
I agree, although it was useful to think through.