Any benefit of multiple sample_sets in allele frequency spectrum
See original GitHub issueSee also #203.
Working through the algorithms for the AFS it’s not obvious to me that there’s any real benefit perf-wise to providing multiple sample_set arguments. When we’re computing the AFS values, the only thing we’re actually sharing between the different sample sets is the parents
array (the clinching argument for allowing other run concurrently is that we share some calculations across different output stats, thereby making it all more efficient).
When we allow for multiple sample sets in the AFS functions we are hit with choices about how we shape the output array. Firstly, to make the arrays rectangular, we need to make the frequency dimension equal to the size of the largest sample set, which is potentially wasteful if one sample set is much larger than the other. The second choice is which order the dimensions should go - it’s not clear to me whether either ordering is better.
Given all this, I wonder if it’s worth the trouble allowing for multiple sample sets when computing the AFS, particularly when you’d almost certainly be better off running several of these in parallel rather than the current vectorised version - the memory access patterns will be pretty nasty, I think.
Thoughts @petrelharp?
Issue Analytics
- State:
- Created 4 years ago
- Comments:6 (6 by maintainers)
Sounds good - only one sample set is fine. Although it’s maybe worth thinking ahead to how the joint AFS will work.
Closed in #248 and #274.