Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Windowing along the genome

See original GitHub issue

Computing statistics in windows along the genome is a basic requirement. For example, users will often want to compute Fst values in (say) 100kb chunks along the genome. I think there’s two basic approaches we could take here:

Add a “windows” argument to all functions

Add an argument windows to all of the functions in which this would make sense, which would follow a similar logic to how tskit does it. I think this is better than the approach taken in scikit-allel, where we have functions like windowed_patterson_fst.

To make things concrete, here’s a rough idea of what Fst would look like:

def parse_windows(ds, windows)
      L = largest_variant_position(ds) + 1 # assuming we can get this
      if windows is None:
          # By default we have one window
           windows = [0, L]    
      elif isinstance(windows, str):
            # Obviously this is stupid and we'd parse out the suffix
            if windows == "100kb":
                step_size = 100_000
           else:
                 raise ValueError("Unrecognised window description")
           windows = np.arange(0, L, step=step_size)
     elif isinstance(windows, int):   # This is brittle and we'd probably do something better in reality
          # Interpret the argument as asking for n windows
          windows = np.linspace(0, L, num=windows)
     # Otherwise, assume that windows is a 1D array of n values describing n intervals along the genome.
     return windows
     

def Fst(ds, *, windows=None):
      windows = parse_windows
      # compute Fst in variants in these windows and return an array/scalar depending on the input.

So, in the most general case, windows is a 1D array of n coordinates describing n - 1 intervals along the genome. The intervals are half-open (left-inclusive, right-exclusive). We also provide some other handy ways of specifying common types of windows by interpreting different argument types differently. These bells and whistles probably aren’t necessary initially, though.

Store stats and window afterwards

The previous formulation assumes that what we’re returning is a single value or numpy array, but this is somewhat at odds with current thinking. In #103 we are assuming that we update the input data set with a variable for each thing that we compute on the dataset. Thus, an alternative way we might do things is that we compute the value of Fst for each variant and store the Fst array in the result dataset. Then, windowing is something that we do afterwards, by aggegating values. E.g.,

ds = sgkit.Fst(ds)  # ds now contains a "stats_Fst" variable, or similar
windowed_ds = sgkit.window_stats(ds, "100kb")

I guess this latter approach is a more flexible and idiomatic?

Issue Analytics

State:
Created 3 years ago
Reactions:1
Comments:10

Top GitHub Comments

1reaction

eric-czechcommented, Sep 2, 2020

One approach to this that I think could generalize well in the context of Dask is to have a function similar to hail.locus_interval. We have the additional wrinkle of needing to worry about the chunking so I had made something like this:

# `window` = base pairs or numbers of variants
variant_intervals, chunk_intervals = api.axis_intervals(ds, window=100_000, unit='physical')
# `group` = chromosome
# `index` = center of window
# `start` = left side of window
# `stop` = right side of window
variant_intervals.to_dataset('var').to_dataframe().sample(10, random_state=1)

Screen Shot 2020-09-02 at 12 54 06 PM

The chunk interval data is similar, but also needed to be aware of the number of windows within a chunk in order to pre-allocate array results when using a GPU kernel to process a chunk.

Anyways, that could be a good way to orient the discussion for windowing since I think a specific function for it would be ideal rather than trying to bake it into individual methods.

+1 to something like windowed_ds = sgkit.window_stats(ds, "100kb") that would wrap up the chunking details for sure.

0reactions

tomwhitecommented, Nov 30, 2020

Closing this now that the basic mechanism is working and documented (#303, #404). There is follow-up work in #341.

Top Results From Across the Web

Defining window-boundaries for genomic analyses using ...

In general, window-based techniques treat observations from individual genetic markers, often single nucleotide polymorphisms (SNPs), as samples ...

Defining window-boundaries for genomic analyses ... - NCBI

In general, window-based techniques treat observations from individual genetic markers, often single nucleotide polymorphisms (SNPs), as samples ...

Sliding window differentiation, variance and introgression

In this tutorial, we are going to compute four of them in genomic windows: pi, a measure of genetic variation; Fst, a measure...

Estimating optimal window size for analysis of low-coverage ...

In the context of very low-coverage sequencing (<0.1×), performing 'binning' or 'windowing' on mapped short sequences ('reads') is critical to extract genomic ......

What is the "sliding window" in a bio statistical analysis, e.g. in ...

Sliding windows are genomic intervals that literally "slide" across the genome, almost always by some constant distance. These windows are mapped to files ......