Genetics data IO performance stats/doc

This is a dump of some of the performance experiments. It’s part of a larger issue of performance setup and best practices for dask/sgkit and genetic data. The goal is to share the findings and continue the discussion.

Where not otherwise stated, the test machine is a GCE VM, 16 cores and 64GB of memory, 400 SPD. Dask cluster is a single node process based. If the data is read from GCS, the bucket is in the same region as the VM:

The issue with suboptimal saturation was originally reported for this code:

import fsspec
import xarray as xr
from import unpack_variables
from dask.diagnostics import ProgressBar, ResourceProfiler, Profiler

path = "gs://foobar/data.zarr"
store = fsspec.mapping.get_mapper(path, check=False, create=False)
ds = xr.open_zarr(store, concat_characters=False, consolidated=False)
ds = unpack_variables(ds, dtype='float16')

ds["variant_dosage_std"] = ds["call_dosage"].astype("float32").std(dim="samples")
with ProgressBar(), Profiler() as prof, ResourceProfiler() as rprof:
    ds['variant_dosage_std'] = ds['variant_dosage_std'].compute()

With local input, performance graph:

bokeh_plot (1)

It’s pretty clear the cores are well saturated. I also measure GIL, GIL was held for 13% of time and waited on for 2.1%, with each worker thread (16 threads) holding it for 0.7% and waiting for 0.1% of time.

For GCS input (via fsspec):

bokeh_plot (2)

GIL summary: GIL was held for 18% of time and waited on for 3.8%, with each worker thread (16 threads) holding it for 0.6% and waiting for 0.2% of time, with one thread holding GIL for 6.5% and waiting 1.6% time.

held: 0.186 (0.191, 0.187, 0.186)
wait: 0.038 (0.046, 0.041, 0.039)
    held: 0.015 (0.029, 0.017, 0.015)
    wait: 0.002 (0.002, 0.002, 0.002)
    held: 0.065 (0.061, 0.064, 0.065)
    wait: 0.016 (0.015, 0.017, 0.016)
    held: 0.0 (0.0, 0.0, 0.0)
    wait: 0.0 (0.0, 0.0, 0.0)
    held: 0.006 (0.006, 0.006, 0.006)
    wait: 0.002 (0.002, 0.002, 0.002)
    held: 0.006 (0.006, 0.006, 0.006)
    wait: 0.002 (0.002, 0.002, 0.001)
    held: 0.006 (0.008, 0.007, 0.007)
    wait: 0.002 (0.001, 0.001, 0.002)
    held: 0.006 (0.006, 0.006, 0.006)
    wait: 0.002 (0.001, 0.001, 0.002)
    held: 0.006 (0.006, 0.006, 0.006)
    wait: 0.001 (0.001, 0.001, 0.001)
    held: 0.006 (0.006, 0.006, 0.006)
    wait: 0.001 (0.001, 0.001, 0.001)
    held: 0.006 (0.006, 0.006, 0.006)
    wait: 0.002 (0.002, 0.002, 0.002)
    held: 0.006 (0.006, 0.006, 0.006)
    wait: 0.002 (0.001, 0.002, 0.002)
    held: 0.006 (0.006, 0.006, 0.006)
    wait: 0.002 (0.001, 0.002, 0.002)
    held: 0.006 (0.007, 0.007, 0.007)
    wait: 0.002 (0.001, 0.002, 0.002)
    held: 0.006 (0.006, 0.006, 0.006)
    wait: 0.002 (0.002, 0.002, 0.002)
    held: 0.006 (0.006, 0.006, 0.006)
    wait: 0.002 (0.002, 0.002, 0.002)
    held: 0.006 (0.006, 0.007, 0.006)
    wait: 0.002 (0.002, 0.002, 0.002)
    held: 0.006 (0.006, 0.006, 0.006)
    wait: 0.001 (0.001, 0.001, 0.001)
    held: 0.006 (0.006, 0.006, 0.006)
    wait: 0.001 (0.002, 0.002, 0.001)
    held: 0.006 (0.006, 0.006, 0.006)
    wait: 0.002 (0.001, 0.001, 0.001)
    held: 0.0 (0.0, 0.0, 0.0)
    wait: 0.0 (0.0, 0.0, 0.0)
    held: 0.001 (0.0, 0.001, 0.001)
    wait: 0.001 (0.001, 0.001, 0.001)
    held: 0.002 (0.002, 0.002, 0.002)
    wait: 0.001 (0.001, 0.001, 0.001)
    held: 0.001 (0.001, 0.001, 0.001)
    wait: 0.003 (0.012, 0.004, 0.003)

It’s clear that the CPU usage is lower, and not fully saturated, GIL wait time is a bit up (with a concerning spike in one thread). With remote/fsspec input, we have the overhead of data decryption and potential network IO overhead (tho it doesn’t seem like we hit network limits).

jeromekellehercommented, Jan 13, 2021

A lot to digest here, thanks for the great work @ravwojdyla!

ravwojdylacommented, Jan 19, 2021

Based on the performance tests done above, here are some high level guidelines for dask performance experiments (this is a starting point, we might find a better home for this later, and potentially have someone from Dask review them):

  • disable spilling during performance tuning (for example: when you search for the right chunking scheme, spilling involves serde and IO, both are expensive, it’s better to immediately fail spilling job since the chunking and/or cluster spec is likely suboptimal)
def get_dask_cluster(n_workers=1, threads_per_worker=None):
    dk.config.set({"distributed.worker.memory.terminate": False})
    workers_kwargs = {"memory_target_fraction": False,
                      "memory_spill_fraction": False,
                      "memory_pause_fraction": .9}
    return Client(n_workers=n_workers,
  • use distributed cluster (and if there is a single worker and you perform non-numpy operations measure GIL, see gil_load, if you want to measure cluster communication overhead, you need more workers)
  • capture performance report, see diagnostics distributed
  • if you need high granularity of VM status use atop to record stats, remember to install netatop to capture network stats
  • if you read data from GCS, make sure your VM is in the same region as the data bucket


  • add VM specs
