Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Understand Hail GWAS regression implementation

See original GitHub issue

sgkit cost estimate on UKB data: https://github.com/related-sciences/ukb-gwas-pipeline-nealelab/issues/32
hail cost estimate on UKB data: https://github.com/Nealelab/UK_Biobank_GWAS/issues/37
Work to improve our costs: https://github.com/pystatgen/sgkit/issues/390

We’d like to get our costs closer to Hail’s costs. To do so, it would be helpful to understand the Hail implementation and see if there are any ideas in their implementation that we might reuse in ours.

Issue Analytics

State:
Created 3 years ago
Comments:11 (1 by maintainers)

Top GitHub Comments

1reaction

tomwhitecommented, Feb 17, 2021

I tried with chunks: {variant: 64, sample: -1} and it was a bit faster than map_blocks on this dataset:

original code without chunking in the samples dimension: 68 s * 128 cores / 67552 variants / phenotype = 0.13 core seconds/variant/phenotype

Here’s the notebook and performance report.

1reaction

tomwhitecommented, Feb 11, 2021

I tried another experiment, where I used Dask map_blocks to independently process each block of variants. This is akin to what Hail does (except I’m still doing covariate projection as discussed above). It’s important that the array is not chunked in the samples dimension, which means that the chunk size in the variants dimension has to be quite small. I used 64 to give ~100MB chunks.

On 8x data the processing time on a 16 node cluster was 77s, compared to 110s from the equivalent run in https://github.com/pystatgen/sgkit/issues/390#issuecomment-768332568. This is a 1.4x speedup.

Translating this into normalized numbers (using https://github.com/pystatgen/sgkit/issues/390#issuecomment-768380382):

Original implementation: 150 s * 960 cores / 141910 variants / phenotype = 1.01 core seconds/variant/phenotype
Improved chunking: 185 s * 192 cores / 141910 variants / phenotype = 0.25 core seconds/variant/phenotype
map_blocks: 77 s * 128 cores / 67552 variants / phenotype = 0.15 core seconds/variant/phenotype

This is a ~6x speedup from the original, and if we could use preemptible instances to get a ~5x cost saving, I think that would put us in the same cost ballpark as Hail.

Ideally Dask would do this kind of optimization for us so we didn’t have to resort to map_blocks, but it’s good to know that this is a technique we can fall back to if needed.

Here’s the notebook I used, and the performance report.

Top Results From Across the Web

GWAS Tutorial - Hail

This notebook is designed to provide a broad overview of Hail's functionality, with emphasis on the functionality to manipulate and query a genetic...

BroadE: Hail - Practical 2: Genome Wide Association Studies ...

Hail is an open-source library that provides accessible interfaces for exploring genomic data, with a backend that automatically scales to ...

GWAS Tutorial - GitHub Pages

As explained in the Hail tutorial, the data contains a confounder, so it is necessary to include ancestry as a covariate in the...

hail-practical-7-association

On Tuesday, you went through Jeff's GWAS practical using PLINK. In this practical, we're going to learn how to use Hail to perform...

Hail workshop - | notebook.community

Using Jupyter notebooks effectively; Loading genetic data into Hail ... Running a Genome-Wide Association Study (GWAS); Rare variant burden tests ...