Large memory increase and processing slowness during graph creation
See original GitHub issueWhat happened:
When creating a graph consisting of delayed dataframes created from 20k image blocks, the memory climbs into 10+ gigabytes and it takes several minutes to create. No execution of graph is done. I experimented with fewer blocks, and the performance is quite non-linear. The processing time and memory taken up (measured via GKE jupyterhub pod memory usage) by number of blocks:
1000 => 3s
2000 => 9s
4000 => 28s, 0.6GB
8000 => 93s, 2.4GB
16000 => 324s, 9.1GB
What you expected to happen:
I expected 20k blocks this to fit under a minute and under a gigabyte. I also expected linear performance with number of blocks, as the blocks are independent (as can be verified by .visualize(..) on a single dataframe)
Minimal Complete Verifiable Example:
Scaled down example with 8k blocks, which is enough to demonstrate the memory growth. My actual dataset has 20k blocks, roughly 3TB divided into 150MB blocks – but the problem seems to be purely a function of block count, not datasize.
import dask
import dask.array as da
import dask.dataframe as dd
import numpy as np
image = da.zeros(8000, dtype=np.uint16, chunks=1)
block_iter = zip(np.ndindex(*image.numblocks), image.to_delayed().flatten())
ddf_all = np.empty(image.numblocks, dtype=object)
for idx_chunk, chunk in block_iter:
ddf_delayed = dask.delayed(lambda x: None)(chunk)
ddf_all[idx_chunk] = dd.from_delayed(ddf_delayed, meta=[("z", np.float32)])
Environment:
- Dask version: 2021.06.2
- Python version: 3.8.10
- Operating System: Ubuntu 18.04
- Install method (conda, pip, source): conda
Issue Analytics
- State:
- Created 2 years ago
- Comments:13 (11 by maintainers)

Top Related StackOverflow Question
Ah that makes sense, every
from_delayedprobably had to materialize and merge the full un-culled graph of the array. Nice! I had no idea I was fixing this.The fix by #8174 by @gjoseph92 fixed this issue. In my testing, the 8000 block reproducer finished in <8s with minimal memory usage.