question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Understand Hail GWAS regression implementation

See original GitHub issue

We’d like to get our costs closer to Hail’s costs. To do so, it would be helpful to understand the Hail implementation and see if there are any ideas in their implementation that we might reuse in ours.

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:11 (1 by maintainers)

github_iconTop GitHub Comments

1reaction
tomwhitecommented, Feb 17, 2021

I tried with chunks: {variant: 64, sample: -1} and it was a bit faster than map_blocks on this dataset:

  • original code without chunking in the samples dimension: 68 s * 128 cores / 67552 variants / phenotype = 0.13 core seconds/variant/phenotype

Here’s the notebook and performance report.

1reaction
tomwhitecommented, Feb 11, 2021

I tried another experiment, where I used Dask map_blocks to independently process each block of variants. This is akin to what Hail does (except I’m still doing covariate projection as discussed above). It’s important that the array is not chunked in the samples dimension, which means that the chunk size in the variants dimension has to be quite small. I used 64 to give ~100MB chunks.

On 8x data the processing time on a 16 node cluster was 77s, compared to 110s from the equivalent run in https://github.com/pystatgen/sgkit/issues/390#issuecomment-768332568. This is a 1.4x speedup.

Translating this into normalized numbers (using https://github.com/pystatgen/sgkit/issues/390#issuecomment-768380382):

  • Original implementation: 150 s * 960 cores / 141910 variants / phenotype = 1.01 core seconds/variant/phenotype
  • Improved chunking: 185 s * 192 cores / 141910 variants / phenotype = 0.25 core seconds/variant/phenotype
  • map_blocks: 77 s * 128 cores / 67552 variants / phenotype = 0.15 core seconds/variant/phenotype

This is a ~6x speedup from the original, and if we could use preemptible instances to get a ~5x cost saving, I think that would put us in the same cost ballpark as Hail.

Ideally Dask would do this kind of optimization for us so we didn’t have to resort to map_blocks, but it’s good to know that this is a technique we can fall back to if needed.

Here’s the notebook I used, and the performance report.

Read more comments on GitHub >

github_iconTop Results From Across the Web

GWAS Tutorial - Hail
This notebook is designed to provide a broad overview of Hail's functionality, with emphasis on the functionality to manipulate and query a genetic...
Read more >
BroadE: Hail - Practical 2: Genome Wide Association Studies ...
Hail is an open-source library that provides accessible interfaces for exploring genomic data, with a backend that automatically scales to ...
Read more >
GWAS Tutorial - GitHub Pages
As explained in the Hail tutorial, the data contains a confounder, so it is necessary to include ancestry as a covariate in the...
Read more >
hail-practical-7-association
On Tuesday, you went through Jeff's GWAS practical using PLINK. In this practical, we're going to learn how to use Hail to perform...
Read more >
Hail workshop - | notebook.community
Using Jupyter notebooks effectively; Loading genetic data into Hail ... Running a Genome-Wide Association Study (GWAS); Rare variant burden tests ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found