Windowing along the genome
See original GitHub issueComputing statistics in windows along the genome is a basic requirement. For example, users will often want to compute Fst values in (say) 100kb chunks along the genome. I think there’s two basic approaches we could take here:
Add a “windows” argument to all functions
Add an argument windows
to all of the functions in which this would make sense, which would follow a similar logic to how tskit does it. I think this is better than the approach taken in scikit-allel, where we have functions like windowed_patterson_fst
.
To make things concrete, here’s a rough idea of what Fst would look like:
def parse_windows(ds, windows)
L = largest_variant_position(ds) + 1 # assuming we can get this
if windows is None:
# By default we have one window
windows = [0, L]
elif isinstance(windows, str):
# Obviously this is stupid and we'd parse out the suffix
if windows == "100kb":
step_size = 100_000
else:
raise ValueError("Unrecognised window description")
windows = np.arange(0, L, step=step_size)
elif isinstance(windows, int): # This is brittle and we'd probably do something better in reality
# Interpret the argument as asking for n windows
windows = np.linspace(0, L, num=windows)
# Otherwise, assume that windows is a 1D array of n values describing n intervals along the genome.
return windows
def Fst(ds, *, windows=None):
windows = parse_windows
# compute Fst in variants in these windows and return an array/scalar depending on the input.
So, in the most general case, windows
is a 1D array of n coordinates describing n - 1 intervals along the genome. The intervals are half-open (left-inclusive, right-exclusive). We also provide some other handy ways of specifying common types of windows by interpreting different argument types differently. These bells and whistles probably aren’t necessary initially, though.
Store stats and window afterwards
The previous formulation assumes that what we’re returning is a single value or numpy array, but this is somewhat at odds with current thinking. In #103 we are assuming that we update the input data set with a variable for each thing that we compute on the dataset. Thus, an alternative way we might do things is that we compute the value of Fst
for each variant and store the Fst
array in the result dataset. Then, windowing is something that we do afterwards, by aggegating values. E.g.,
ds = sgkit.Fst(ds) # ds now contains a "stats_Fst" variable, or similar
windowed_ds = sgkit.window_stats(ds, "100kb")
I guess this latter approach is a more flexible and idiomatic?
Issue Analytics
- State:
- Created 3 years ago
- Reactions:1
- Comments:10
Top GitHub Comments
One approach to this that I think could generalize well in the context of Dask is to have a function similar to hail.locus_interval. We have the additional wrinkle of needing to worry about the chunking so I had made something like this:
The chunk interval data is similar, but also needed to be aware of the number of windows within a chunk in order to pre-allocate array results when using a GPU kernel to process a chunk.
Anyways, that could be a good way to orient the discussion for windowing since I think a specific function for it would be ideal rather than trying to bake it into individual methods.
+1 to something like
windowed_ds = sgkit.window_stats(ds, "100kb")
that would wrap up the chunking details for sure.Closing this now that the basic mechanism is working and documented (#303, #404). There is follow-up work in #341.