Bottleneck in terms of time consuming of HWE benchmark
See original GitHub issueHWE benchmark is doing the similar things to GWAS tutorial. Because a bottleneck in sgkit dataset final selection, I can only run the benchmark on chromosome 21(201MB in original vcf) instead of chr1-22(14GB in original vcf).
Sgkit is doing:
ds = sg.variant_stats(ds)
ds = sg.hardy_weinberg_test(ds, alleles=2)
ds = ds.sel(variants=((ds.variant_allele_frequency[:,1] > 0.01) & (ds.variant_hwe_p_value > 1e-6)))
Hail is doing:
mt = hl.variant_qc(mt)
mt = mt.filter_rows(mt.variant_qc.AF[1] > 0.01)
mt = mt.filter_rows(mt.variant_qc.p_value_hwe > 1e-6)
PLINK has flags:
--hardy --hwe 1e-6 --maf 0.01
The three lines of sgkit take (0.067585, 1.688614, 375.0212) seconds respectively, when using 16 core, sel()
takes a lot and Hail’s filter_rows()
only take 0.010368 seconds.
Sgkit’s runtime is dominating by xarray.Dataset.sel()
function, and much more than Hail and PLINK take. Do you have an idea, what could be an equivalent but more efficient syntax to do the same selection here?
I research a bit about xarray, sel(), it seems to have been inefficient for years with an open issue (sel()
call isel()
inside). variants=((ds.variant_allele_frequency[:,1] > 0.01) & (ds.variant_hwe_p_value > 1e-6))
is indexing with boolean arrays, and might not be the preferred way by sel()
.
Issue Analytics
- State:
- Created 2 years ago
- Comments:8
A good news is that this method has reduced
sel()
time from 375 to 20 seconds, calculated on chr21.Yes, forcing everything to be eager seems like a bad measure of how things would work in practise, since this is something we’d discourage.