question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Means of zarr arrays cause a memory overload in dask workers

See original GitHub issue

What is your issue?

Hello everyone !

I am submitting this issue here but it is not entirely clear if my problem comes from xarray, dask or zarr.

The goal here is to compute a mean from the GCM anomalies of SSH. The following simple code creates an artificial dataset (a variable is about 90G) with the anomaly fields, and compute the cross-products means.

import dask.array as da
import numpy as np
import xarray as xr

ds = xr.Dataset(
    dict(
        anom_u=(["time", "face", "j", "i"], da.ones((10311, 1, 987, 1920), chunks=(10, 1, 987, 1920))),
        anom_v=(["time", "face", "j", "i"], da.ones((10311, 1, 987, 1920), chunks=(10, 1, 987, 1920))),
    )
)

ds["anom_uu_mean"] = (["face", "j", "i"], np.mean(ds.anom_u.data**2, axis=0))
ds["anom_vv_mean"] = (["face", "j", "i"], np.mean(ds.anom_v.data**2, axis=0))
ds["anom_uv_mean"] = (["face", "j", "i"], np.mean(ds.anom_u.data * ds.anom_v.data, axis=0))

ds[["anom_uu_mean", "anom_vv_mean", "anom_uv_mean"]].compute()

I was expecting a low memory usage because after using a single chunk of anom_u and anom_v to do a mean iteration, these two could be forgotten. The following figure checks that we are very low on memory usage so all is well.

image

The matter becomes more complicated when the dataset is opened from a ZARR store. We simply dumped our previous articially generated data to a temporary store, and reloaded it :

import dask.array as da
import numpy as np
import xarray as xr

ds = xr.Dataset(
    dict(
        anom_u=(["time", "face", "j", "i"], da.ones((10311, 1, 987, 1920), chunks=(10, 1, 987, 1920))),
        anom_v=(["time", "face", "j", "i"], da.ones((10311, 1, 987, 1920), chunks=(10, 1, 987, 1920))),
    )
)

store = "/work/scratch/test_zarr_graph"
ds.to_zarr(store, compute=False, mode="a")
ds = xr.open_zarr(store)

ds["anom_uu_mean"] = (["face", "j", "i"], np.mean(ds.anom_u.data**2, axis=0))
ds["anom_vv_mean"] = (["face", "j", "i"], np.mean(ds.anom_v.data**2, axis=0))
ds["anom_uv_mean"] = (["face", "j", "i"], np.mean(ds.anom_u.data * ds.anom_v.data, axis=0))

ds[["anom_uu_mean", "anom_vv_mean", "anom_uv_mean"]].compute()

image

I was expecting a similar behavior between a dataset created from scratch and one created from a zarr store, but it seems not to be the case. I tried using inline_array=True with xr.open_dataset but to no avail. I also tried computing 2 variables instead of 3 and it works properly, so the behavior seems strange to me.

Do you see any reason as to why I am seeing such memory load on my workers ?

Here are the software version I use : xarray version : 2022.6.0rc0 dask version : 2022.04.1 zarr version : 2.11.1 numpy version : 1.21.6

Issue Analytics

  • State:open
  • Created a year ago
  • Comments:10 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
gjoseph92commented, Jun 23, 2022

FYI @robin-cls I would be a bit surprised if there is anything you can do on your end to fix things here with off-the-shelf dask. What @dcherian mentioned in https://github.com/dask/distributed/issues/6360#issuecomment-1129484190 is probably the only thing that might work. Otherwise you’ll need to run one my experimental branches.

1reaction
dcheriancommented, Jun 23, 2022

Thanks @gjoseph92 I think there is a small but important increase in complexity here because we do ds.u.mean(), ds.v.mean(), (ds.u*ds.v).mean() all together so each chunk of ds.u and ds.v is used for two different outputs.

IIUC the example in https://github.com/dask/distributed/issues/6571 is basically (ds.u * ds.v).mean() purely.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Storing Dask Array using Zarr Consumes Too Much Memory
Here is my guess why dask runs out of memory. It tries to resolve all the rechunking-merging first, and by the time it...
Read more >
Tackling unmanaged memory with Dask - Coiled
Unmanaged memory is RAM that the Dask scheduler is not directly aware of and which can cause workers to run out of memory...
Read more >
optimizing workers and memory · dask · Zulip Chat Archive
Specifically my apply_ufunc seems to overload two workers, regardless of how data are being chunked. The notebook below works fine if I only ......
Read more >
Using da.delayed for Zarr processing: memory overhead ...
delayed & indexing. The core motivation for this is that we need to process lists of user-specified regions of interests in a large...
Read more >
Parallel computing with Dask - Xarray
Dask divides arrays into many small pieces, called chunks, each of which is presumed to be small enough to fit into memory.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found