Tools for ancestry/inheritance calculation
See original GitHub issueHere’s some closely related use cases that we’re seeing (e.g. #612). I’d like to collect them in one place so we can settle on a minimal number of useful methods and an API. So far:
-
(ancestry painting) Given a list of nonoverlapping sets of nodes, the
ancestors
, and a collection of samples, return somehow an assignment for each sample of which set of nodes it inherits from on each part of the genome (if any). -
(ancestry proportions) Instead of the whole ancestry painting, return summaries of it in windows: what proportion of the samples’s genomes inherit from each set of ancestors, in each window.
-
(number of descendants) Given a list of samples and a list of nonoverlapping sets “ancestral” nodes, return for each set of ancestors what proportion of the samples’ genomes that inherit from that set of nodes. (This should also be done in windows.)
The last two sound very similar, but one is indexed by ancestors; the other by descendants.
Issue Analytics
- State:
- Created 5 years ago
- Comments:22 (22 by maintainers)
I reckon we can close it given various bits of progress over the last few years. Just looking at the issues in the OP: (1) has been addressed by
link_ancestors
(though some additional ‘wrapper’ that makes its uses more obvious might be good in the future too.) (2) sounds like a node statistic that would be fairly easy to code up if wanted (and if you haven’t done so already, @petrelharp).(3) hasn’t been done to my knowledge, but now I’m reading over it again, I think some details would need to be ironed out before it was – for example, what would you do if the ancestral node only has extant ancestry in some of the window? I suspect that part of the reason why we dropped this is that there was no immediate use case for it, compared with these other things.
In general, I think we’ve done a lot of work making various tools that are useful for analyses of ancestry and inheritance, and any further work would be better to discuss in a more specific issue.
Something nonobvious I’ve discovered in working in that sandbox. It is sometimes natural to work with boolean arrays. But, we should not end up sometimes using boolean arrays and sometimes using arrays of indices, and translating boolean arrays to indices with
np.where( )
. It’s hard to keep track of what is what, and I just spent a while tracking down a bug because I appliednp.where( )
to an array of indices, which produced no errors at all.