Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Change semantics to ignore edges not ancestral to samples

See original GitHub issue

Working through the details of #248, it’s become clear to me that there’s an inconsistency in how we’re treating edges that are not ancestral to any samples. Consider the following tree sequence:

############################################################
#   Nodes                                                  #
############################################################
id      flags   population      individual      time    metadata
0       1       -1      -1      0.00000000000000
1       1       -1      -1      0.00000000000000
2       0       -1      -1      1.00000000000000
3       0       -1      -1      0.00000000000000
4       0       -1      -1      1.00000000000000
############################################################
#   Edges                                                  #
############################################################
id      left            right           parent  child
0       0.00000000      1.00000000      2       0
1       0.00000000      1.00000000      2       1
2       0.00000000      1.00000000      4       3

When we draw this, we get:

1.00┊  2  ┊
    ┊ ┏┻┓ ┊
0.00┊ 0 1 ┊
  0.00  1.00

So, we have two nodes and one edge that are not ancestral to any sample. We don’t draw this edge because it’s not reachable from any root (defined as the last node on an upwards path from a sample). In this case, we compute the total branch length of the tree to be 2. This makes sense: if we threw down mutations on these branches, the number of segregating sites we’d observe in the samples would be proportional to 2.

However, if we changed the topology slightly, so that the non-ancestral edge ended with node 2, we would compute the total branch length to be 3. This is inconsistent: the branch is still not ancestral to any samples, but by changing the topology somewhere above it, we have suddenly started to count it towards our statistics.

At the moment, we are counting all of this ‘silent’ topology towards some branch length statistics, which is, I think, a mistake. This is particularly confusing for the branch AFS, as we get quite different answers when using the current efficient edge algorithm and the naive tree traversal algorithm (both of which are wrong, according to this argument: the edge algorithm counts everything and the tree algorithm counts any edges that are reachable from a root).

To remedy this, I think we should do the following:

Change the nodes iterator to by default only traverse down paths that end in a sample (this is easy to do for the default pre-order, we just don’t go down paths with num_samples == 0. Not sure it’s easy to do efficiently for other orders, but can be done inefficiently by simple filtering anyway).
This will change the definition of total_branch_length to the total branch length ancestral to samples.
Change branch length stats to only consider branches ancestral to samples. In the branch AFS we’d do this by keeping an extra count for all samples, and only updating the AFS when this is > 0. I’m not sure how it would affect the general definition, but presumably it’s something similar.

Issue Analytics

State:
Created 4 years ago
Comments:6 (6 by maintainers)

Top GitHub Comments

1reaction

petrelharpcommented, Jul 25, 2019

This keep some annoying corner cases out of the f definitions then, right?

For future weighted stats, yes. For sample set stats, not really, because there we already have to count as zero anything not ancestral to any of the sample sets, so it wouldn’t help. So, this sounds like a good idea, but lacking any immediate uses, I’d propose putting this off until we actually have some more use cases?

0reactions

jeromekellehercommented, Jul 24, 2019

I think that in general we should leave in all the bits of the trees; removing them silently is confusing. The AFS is different, because that’s supposed to be something calculable from the given samples. So, I still vote to leave total branch length as it is; if there’s a need for it we could provide a different function… oh wait, we have: this is segregating_sites(ts.samples(), span_normalise=False, mode=“branch”)

OK, let’s leave things as they are from this perspective so.

Oh: about the branch length statistics - all the statistics currently implemented except the AFS have the property you say. It’d make future implementation easier if we didn’t have to build this in explicitly (like I did here and here. Note: all the sample set stats have a stronger property: they don’t depend on branches not ancestral to any of the sample_sets.

Ah, that’s good to know. So, how about this for a plan: we build in a counter for all samples into the general algorithm, and we don’t call f for any node that this is zero. This keep some annoying corner cases out of the f definitions then, right? This should be equivalent to what we’ve already agreed on for the AFS.

Top Results From Across the Web

Semantic Change: Definition, Causes & Examples

There are two different causes of semantic change. These are extralinguistic causes (not involving language) and linguistic causes (involving language).

On two mathematical representations for “semantic maps”

We describe two mathematical representations for what have come to be called “semantic maps”, that is, representations of typological ...

flutter/semantics.dart at master - GitHub

As an example, the [RenderSemanticsGestureHandler] uses tags to determine ... The [attributes] must not be changed after the attributed string is.

4.2 Directed Graphs - Algorithms, 4th Edition

A directed cycle is simple if it has no repeated vertices (other than the requisite repetition of the first and last vertices).

Extensible Stylesheet Language (XSL) Version 2.0 - W3C

Thus, an XSL processor is always free to ignore such attributes, and must ignore such attributes without giving an error if it does...