Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

implement mean_descendants as a node statistic

See original GitHub issue

As discussed in #271, this is a node statistic. The differences to current node stats are that (a) it normalizes by the amount that the node is actually an ancestor; and (b) it is polarised (and polarised=False will give something that is always equal to 1). I propose that we make the behavior of (a) a parameter, and implement that normalization at the python level.

Issue Analytics

State:
Created 4 years ago
Comments:16 (16 by maintainers)

Top GitHub Comments

1reaction

hyanwongcommented, Apr 17, 2021

Right. But there is a close link between (windowed) GNN and genomic descent. It would be helpful to give an example like this somewhere, when describing it.

1reaction

petrelharpcommented, Apr 17, 2020

I agree, that normalizing by the size of the sample set is always sensible; it’s at least easy for the end user to reverse, so we don’t need an option for it (and certainly not a different function!).

Note: we don’t have a function called compute_ancestry already - when you say “these two exist” do you mean using the definition above in this thread?

All these things are ways of summarizing “what proportion of the genomes of X are inherited from Y” and “over what proportion of the genome is any of X descended from Y”. I vote we should just provide a simple and descriptively-named method for computing these two quantities, in a way that makes it easy to compute these various downstream statistics. For instance, if we define:

def compute_ancestry(ts, sample_sets):
   n = np.array([len(x) for x in sample_sets])
   def f(x):
      return x/n
   A = ts.sample_count_stat(sample_sets, f, len(sample_sets), polarised=True, mode='node', strict=False)
   return A

def any_ancestry(ts, sample_sets):
   n = np.array([len(x) for x in sample_sets])
   def f(x):
      return 1.0 * (sum(x) > 0)
   A = ts.sample_count_stat(sample_sets, f, 1, polarised=True, mode='node', strict=False)
   return A

def genomic_descent(ts, sample_sets):
   A = ts.compute_ancestry(sample_sets)
   D = ts.any_ancestry(sample_sets)
   for k in range(A.shape[1]):
      A[:, k] /= D
   return A

So, my propsal would be to remove (or, redefine?) mean_descendants, implement the two functions that I have above named compute_ancestry and any_ancestry, and figure out more descriptive names for them. And, probably, make the normalization by any_ancestry an argument to mean_descendants.

What do you think, @awohns - is this sufficiently general? Ideas for what to call these things?