Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Redefine summary function to update rather than return

See original GitHub issue

While working through the changes needed to implement the AFS in C over in #248, I found myself copying and pasting a depressing amount of code, most of it doing basically the same thing as the general stats machinery.

It occured to me that if we changed the way the summary function works slightly, we should be able to reuse all the machinery when we’re computing the jAFS and be able to specify a summary function like the other stats.

Currently, we’re defining the f such that f(x) returns a 1D result vector, which we then add on to the current output window, w. That is, we do this in the naive case:

    sigma = np.zeros((ts.num_trees, m))
    for tree in ts.trees():
        x = np.zeros((ts.num_nodes, k))
        x[ts.samples()] = w
        for u in tree.nodes(order="postorder"):
            for v in tree.children(u):
                x[u] += x[v]
        if polarised:
            s = sum(tree.branch_length(u) * f(x[u]) for u in tree.nodes())
        else:
            s = sum(
                tree.branch_length(u) * (f(x[u]) + f(total - x[u]))
                for u in tree.nodes())
          sigma[tree.index] = s

whereas we’d now do:

   sigma = np.zeros((ts.num_trees, m))
    for tree in ts.trees():
        x = np.zeros((ts.num_nodes, k))
        x[ts.samples()] = w
        for u in tree.nodes(order="postorder"):
            for v in tree.children(u):
                x[u] += x[v]
             f(tree.branch_length(u), x[u],  polarised, sigma[tree.index])

That is:

We change the summary function to one that directly updates the output window by incrementing it
We push the responsibility for dealing with polarisation down into the summary function (I guess, if you wanted to, you could have two different summary functions polarised & unpolarised for each stat). This is motivated by the AFS, where the polarised & unpolarised are pretty different and I can’t see how we could efficiently do both in a general way. (It’s not clear how this generalises to the site algorithm though, this is a bit different.)
We pass in a constant multiplier factor so that branch length/span multiplied in as the values are computed.

The summary function will know about the expected output dimensions (an arbitrary n-D array)

This is a lot less elegant mathematically, but should be much more convenient and efficient computationally. The goal would then be that we do the minimal number of updates to the output window to compute the stats that we want (i.e. touching memory as little as possible). We can also them put in arbitrary dimensioned arrays as the window elements, which should be general enough for most things!

It’s a big messy chunk of work to actually do this, so I don’t want to get started unless we agree it’s a good idea. We should definitely land #251 before making a start also.

What do you think @petrelharp, @molpopgen?

Issue Analytics

State:
Created 4 years ago
Comments:11 (11 by maintainers)

Top GitHub Comments

1reaction

jeromekellehercommented, Aug 22, 2019

I think we’ll have to push this back to 0.2.1. This isn’t going to be a user-visible part of the API much anyway (certainly not before we publish the paper), so I’m OK with potentially changing it after the initial release.

0reactions

jeromekellehercommented, Nov 19, 2019

Closing this; I think we’ve change the underlying code in such a way as it’s not relevant any more.