Redefine summary function to update rather than return
See original GitHub issueWhile working through the changes needed to implement the AFS in C over in #248, I found myself copying and pasting a depressing amount of code, most of it doing basically the same thing as the general stats machinery.
It occured to me that if we changed the way the summary function works slightly, we should be able to reuse all the machinery when we’re computing the jAFS and be able to specify a summary function like the other stats.
Currently, we’re defining the f such that f(x) returns a 1D result vector, which we then add on to the current output window, w. That is, we do this in the naive case:
sigma = np.zeros((ts.num_trees, m))
for tree in ts.trees():
x = np.zeros((ts.num_nodes, k))
x[ts.samples()] = w
for u in tree.nodes(order="postorder"):
for v in tree.children(u):
x[u] += x[v]
if polarised:
s = sum(tree.branch_length(u) * f(x[u]) for u in tree.nodes())
else:
s = sum(
tree.branch_length(u) * (f(x[u]) + f(total - x[u]))
for u in tree.nodes())
sigma[tree.index] = s
whereas we’d now do:
sigma = np.zeros((ts.num_trees, m))
for tree in ts.trees():
x = np.zeros((ts.num_nodes, k))
x[ts.samples()] = w
for u in tree.nodes(order="postorder"):
for v in tree.children(u):
x[u] += x[v]
f(tree.branch_length(u), x[u], polarised, sigma[tree.index])
That is:
- We change the summary function to one that directly updates the output window by incrementing it
- We push the responsibility for dealing with polarisation down into the summary function (I guess, if you wanted to, you could have two different summary functions polarised & unpolarised for each stat). This is motivated by the AFS, where the polarised & unpolarised are pretty different and I can’t see how we could efficiently do both in a general way. (It’s not clear how this generalises to the site algorithm though, this is a bit different.)
- We pass in a constant multiplier factor so that branch length/span multiplied in as the values are computed.
The summary function will know about the expected output dimensions (an arbitrary n-D array)
This is a lot less elegant mathematically, but should be much more convenient and efficient computationally. The goal would then be that we do the minimal number of updates to the output window to compute the stats that we want (i.e. touching memory as little as possible). We can also them put in arbitrary dimensioned arrays as the window elements, which should be general enough for most things!
It’s a big messy chunk of work to actually do this, so I don’t want to get started unless we agree it’s a good idea. We should definitely land #251 before making a start also.
What do you think @petrelharp, @molpopgen?
Issue Analytics
- State:
- Created 4 years ago
- Comments:11 (11 by maintainers)
I think we’ll have to push this back to 0.2.1. This isn’t going to be a user-visible part of the API much anyway (certainly not before we publish the paper), so I’m OK with potentially changing it after the initial release.
Closing this; I think we’ve change the underlying code in such a way as it’s not relevant any more.