Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Edge case issue with simplify and missing data

See original GitHub issue

I’ve just hit an edge case which is preventing me round-tripping some missing data examples. If we have an extreme edge of the genome in which only a single sample has non-missing data, then this can be represented by a tree at that point with only a single branch, connecting that sample to the root. However, if we run simplify() on such a tree sequence, the edge is removed (as it only contains unary nodes). That leaves the sample as an “isolated node”, and hence the missing data code in https://github.com/tskit-dev/tskit/pull/272/ flags it up as a case where the genotype should be set to -1, even though in this case, we do have information to properly encode the genotype.

I’m wondering if this is a issue with the missing data code, or the simplify() code? For example, in simplify() it might be considered reasonable not to drop unary nodes from a sample if they connect that sample to the root? But I’m not sure how the root would be identified in this case.

Ping @jeromekelleher and @petrelharp as they are the simplifying and missing data gurus 😃

Issue Analytics

State:
Created 4 years ago
Comments:17 (17 by maintainers)

Top GitHub Comments

2reactions

petrelharpcommented, Aug 8, 2019

Hm: I think that simplify is definately doing the right thing, as originally defined. That edge isn’t reflecting a genealogical relationship between the samples, which is how we’ve defined things.

1reaction

petrelharpcommented, Aug 8, 2019

I agree, although it was useful to think through.