question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Bottleneck in terms of time consuming of HWE benchmark

See original GitHub issue

HWE benchmark is doing the similar things to GWAS tutorial. Because a bottleneck in sgkit dataset final selection, I can only run the benchmark on chromosome 21(201MB in original vcf) instead of chr1-22(14GB in original vcf).

Sgkit is doing:

ds = sg.variant_stats(ds)
ds = sg.hardy_weinberg_test(ds, alleles=2)
ds = ds.sel(variants=((ds.variant_allele_frequency[:,1] > 0.01) & (ds.variant_hwe_p_value > 1e-6)))

Hail is doing:

mt = hl.variant_qc(mt)
mt = mt.filter_rows(mt.variant_qc.AF[1] > 0.01)
mt = mt.filter_rows(mt.variant_qc.p_value_hwe > 1e-6)

PLINK has flags: --hardy --hwe 1e-6 --maf 0.01

The three lines of sgkit take (0.067585, 1.688614, 375.0212) seconds respectively, when using 16 core, sel() takes a lot and Hail’s filter_rows() only take 0.010368 seconds.

Sgkit’s runtime is dominating by xarray.Dataset.sel() function, and much more than Hail and PLINK take. Do you have an idea, what could be an equivalent but more efficient syntax to do the same selection here?

I research a bit about xarray, sel(), it seems to have been inefficient for years with an open issue (sel() call isel() inside). variants=((ds.variant_allele_frequency[:,1] > 0.01) & (ds.variant_hwe_p_value > 1e-6)) is indexing with boolean arrays, and might not be the preferred way by sel().

Issue Analytics

  • State:open
  • Created 2 years ago
  • Comments:8

github_iconTop GitHub Comments

1reaction
LiangdeLIcommented, Oct 18, 2021

You should try this instead:

A good news is that this method has reduced sel() time from 375 to 20 seconds, calculated on chr21.

0reactions
jeromekellehercommented, Oct 21, 2021

Yes, forcing everything to be eager seems like a bad measure of how things would work in practise, since this is something we’d discourage.

Read more comments on GitHub >

github_iconTop Results From Across the Web

13 Biggest Bottlenecks That Keep Your Business from Growing
Learn the common types of bottlenecks in business and what others like you are doing to overcome these issues to grow faster.
Read more >
Your system will always have a bottleneck and that's ... - Reddit
To clarify, most of the time, the GPU being the bottleneck is ... "GPU bottleneck" is a bad term to use in gaming...
Read more >
Fifty years of the bottleneck model: A bibliometric review and ...
The bottleneck model introduced by Vickrey in 1969 has been recognized as a benchmark representation of the peak-period traffic congestion ...
Read more >
How to Conduct a Bottleneck Analysis - MachineMetrics
The effect of bottlenecks can be analyzed across several categories. First, bottlenecks cost time: machine time, higher lead times, and more. All that...
Read more >
Investigating Spark's performance - O'Reilly
A deep dive into performance bottlenecks with Spark PMC member Kay ... measured the amount of time that a job spent blocked on...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found