Understand Hail GWAS regression implementation
See original GitHub issuesgkit
cost estimate on UKB data: https://github.com/related-sciences/ukb-gwas-pipeline-nealelab/issues/32hail
cost estimate on UKB data: https://github.com/Nealelab/UK_Biobank_GWAS/issues/37- Work to improve our costs: https://github.com/pystatgen/sgkit/issues/390
We’d like to get our costs closer to Hail’s costs. To do so, it would be helpful to understand the Hail implementation and see if there are any ideas in their implementation that we might reuse in ours.
Issue Analytics
- State:
- Created 3 years ago
- Comments:11 (1 by maintainers)
Top Results From Across the Web
GWAS Tutorial - Hail
This notebook is designed to provide a broad overview of Hail's functionality, with emphasis on the functionality to manipulate and query a genetic...
Read more >BroadE: Hail - Practical 2: Genome Wide Association Studies ...
Hail is an open-source library that provides accessible interfaces for exploring genomic data, with a backend that automatically scales to ...
Read more >GWAS Tutorial - GitHub Pages
As explained in the Hail tutorial, the data contains a confounder, so it is necessary to include ancestry as a covariate in the...
Read more >hail-practical-7-association
On Tuesday, you went through Jeff's GWAS practical using PLINK. In this practical, we're going to learn how to use Hail to perform...
Read more >Hail workshop - | notebook.community
Using Jupyter notebooks effectively; Loading genetic data into Hail ... Running a Genome-Wide Association Study (GWAS); Rare variant burden tests ...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
I tried with chunks: {variant: 64, sample: -1} and it was a bit faster than
map_blocks
on this dataset:Here’s the notebook and performance report.
I tried another experiment, where I used Dask
map_blocks
to independently process each block of variants. This is akin to what Hail does (except I’m still doing covariate projection as discussed above). It’s important that the array is not chunked in the samples dimension, which means that the chunk size in the variants dimension has to be quite small. I used 64 to give ~100MB chunks.On 8x data the processing time on a 16 node cluster was 77s, compared to 110s from the equivalent run in https://github.com/pystatgen/sgkit/issues/390#issuecomment-768332568. This is a 1.4x speedup.
Translating this into normalized numbers (using https://github.com/pystatgen/sgkit/issues/390#issuecomment-768380382):
map_blocks
: 77 s * 128 cores / 67552 variants / phenotype = 0.15 core seconds/variant/phenotypeThis is a ~6x speedup from the original, and if we could use preemptible instances to get a ~5x cost saving, I think that would put us in the same cost ballpark as Hail.
Ideally Dask would do this kind of optimization for us so we didn’t have to resort to
map_blocks
, but it’s good to know that this is a technique we can fall back to if needed.Here’s the notebook I used, and the performance report.