Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Bottleneck in terms of time consuming of HWE benchmark

See original GitHub issue

HWE benchmark is doing the similar things to GWAS tutorial. Because a bottleneck in sgkit dataset final selection, I can only run the benchmark on chromosome 21(201MB in original vcf) instead of chr1-22(14GB in original vcf).

Sgkit is doing:

ds = sg.variant_stats(ds)
ds = sg.hardy_weinberg_test(ds, alleles=2)
ds = ds.sel(variants=((ds.variant_allele_frequency[:,1] > 0.01) & (ds.variant_hwe_p_value > 1e-6)))

Hail is doing:

mt = hl.variant_qc(mt)
mt = mt.filter_rows(mt.variant_qc.AF[1] > 0.01)
mt = mt.filter_rows(mt.variant_qc.p_value_hwe > 1e-6)

PLINK has flags: --hardy --hwe 1e-6 --maf 0.01

The three lines of sgkit take (0.067585, 1.688614, 375.0212) seconds respectively, when using 16 core, sel() takes a lot and Hail’s filter_rows() only take 0.010368 seconds.

Sgkit’s runtime is dominating by xarray.Dataset.sel() function, and much more than Hail and PLINK take. Do you have an idea, what could be an equivalent but more efficient syntax to do the same selection here?

I research a bit about xarray, sel(), it seems to have been inefficient for years with an open issue (sel() call isel() inside). variants=((ds.variant_allele_frequency[:,1] > 0.01) & (ds.variant_hwe_p_value > 1e-6)) is indexing with boolean arrays, and might not be the preferred way by sel().

Issue Analytics

State:
Created 2 years ago
Comments:8

Top GitHub Comments

1reaction

LiangdeLIcommented, Oct 18, 2021

You should try this instead:

A good news is that this method has reduced sel() time from 375 to 20 seconds, calculated on chr21.

0reactions

jeromekellehercommented, Oct 21, 2021

Yes, forcing everything to be eager seems like a bad measure of how things would work in practise, since this is something we’d discourage.

Top Results From Across the Web

13 Biggest Bottlenecks That Keep Your Business from Growing

Learn the common types of bottlenecks in business and what others like you are doing to overcome these issues to grow faster.

Your system will always have a bottleneck and that's ... - Reddit

To clarify, most of the time, the GPU being the bottleneck is ... "GPU bottleneck" is a bad term to use in gaming...

Fifty years of the bottleneck model: A bibliometric review and ...

The bottleneck model introduced by Vickrey in 1969 has been recognized as a benchmark representation of the peak-period traffic congestion ...

How to Conduct a Bottleneck Analysis - MachineMetrics

The effect of bottlenecks can be analyzed across several categories. First, bottlenecks cost time: machine time, higher lead times, and more. All that...

Investigating Spark's performance - O'Reilly

A deep dive into performance bottlenecks with Spark PMC member Kay ... measured the amount of time that a job spent blocked on...