Change semantics to ignore edges not ancestral to samples
See original GitHub issueWorking through the details of #248, it’s become clear to me that there’s an inconsistency in how we’re treating edges that are not ancestral to any samples. Consider the following tree sequence:
############################################################
# Nodes #
############################################################
id flags population individual time metadata
0 1 -1 -1 0.00000000000000
1 1 -1 -1 0.00000000000000
2 0 -1 -1 1.00000000000000
3 0 -1 -1 0.00000000000000
4 0 -1 -1 1.00000000000000
############################################################
# Edges #
############################################################
id left right parent child
0 0.00000000 1.00000000 2 0
1 0.00000000 1.00000000 2 1
2 0.00000000 1.00000000 4 3
When we draw this, we get:
1.00┊ 2 ┊
┊ ┏┻┓ ┊
0.00┊ 0 1 ┊
0.00 1.00
So, we have two nodes and one edge that are not ancestral to any sample. We don’t draw this edge because it’s not reachable from any root (defined as the last node on an upwards path from a sample). In this case, we compute the total branch length of the tree to be 2. This makes sense: if we threw down mutations on these branches, the number of segregating sites we’d observe in the samples would be proportional to 2.
However, if we changed the topology slightly, so that the non-ancestral edge ended with node 2, we would compute the total branch length to be 3. This is inconsistent: the branch is still not ancestral to any samples, but by changing the topology somewhere above it, we have suddenly started to count it towards our statistics.
At the moment, we are counting all of this ‘silent’ topology towards some branch length statistics, which is, I think, a mistake. This is particularly confusing for the branch AFS, as we get quite different answers when using the current efficient edge algorithm and the naive tree traversal algorithm (both of which are wrong, according to this argument: the edge algorithm counts everything and the tree algorithm counts any edges that are reachable from a root).
To remedy this, I think we should do the following:
- Change the
nodes
iterator to by default only traverse down paths that end in a sample (this is easy to do for the default pre-order, we just don’t go down paths with num_samples == 0. Not sure it’s easy to do efficiently for other orders, but can be done inefficiently by simple filtering anyway). - This will change the definition of
total_branch_length
to the total branch length ancestral to samples. - Change branch length stats to only consider branches ancestral to samples. In the branch AFS we’d do this by keeping an extra count for all samples, and only updating the AFS when this is > 0. I’m not sure how it would affect the general definition, but presumably it’s something similar.
Issue Analytics
- State:
- Created 4 years ago
- Comments:6 (6 by maintainers)
For future weighted stats, yes. For sample set stats, not really, because there we already have to count as zero anything not ancestral to any of the sample sets, so it wouldn’t help. So, this sounds like a good idea, but lacking any immediate uses, I’d propose putting this off until we actually have some more use cases?
OK, let’s leave things as they are from this perspective so.
Ah, that’s good to know. So, how about this for a plan: we build in a counter for all samples into the general algorithm, and we don’t call f for any node that this is zero. This keep some annoying corner cases out of the f definitions then, right? This should be equivalent to what we’ve already agreed on for the AFS.