question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Large memory increase and processing slowness during graph creation

See original GitHub issue

What happened:

When creating a graph consisting of delayed dataframes created from 20k image blocks, the memory climbs into 10+ gigabytes and it takes several minutes to create. No execution of graph is done. I experimented with fewer blocks, and the performance is quite non-linear. The processing time and memory taken up (measured via GKE jupyterhub pod memory usage) by number of blocks:

1000  =>   3s
2000  =>   9s
4000  =>  28s, 0.6GB
8000  =>  93s, 2.4GB
16000 => 324s, 9.1GB

What you expected to happen:

I expected 20k blocks this to fit under a minute and under a gigabyte. I also expected linear performance with number of blocks, as the blocks are independent (as can be verified by .visualize(..) on a single dataframe)

Minimal Complete Verifiable Example:

Scaled down example with 8k blocks, which is enough to demonstrate the memory growth. My actual dataset has 20k blocks, roughly 3TB divided into 150MB blocks – but the problem seems to be purely a function of block count, not datasize.

import dask
import dask.array as da
import dask.dataframe as dd
import numpy as np

image = da.zeros(8000, dtype=np.uint16, chunks=1)
block_iter = zip(np.ndindex(*image.numblocks), image.to_delayed().flatten())

ddf_all = np.empty(image.numblocks, dtype=object)
for idx_chunk, chunk in block_iter:
    ddf_delayed = dask.delayed(lambda x: None)(chunk)
    ddf_all[idx_chunk] = dd.from_delayed(ddf_delayed, meta=[("z", np.float32)]) 

Environment:

  • Dask version: 2021.06.2
  • Python version: 3.8.10
  • Operating System: Ubuntu 18.04
  • Install method (conda, pip, source): conda

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:13 (11 by maintainers)

github_iconTop GitHub Comments

1reaction
gjoseph92commented, Sep 27, 2021

Ah that makes sense, every from_delayed probably had to materialize and merge the full un-culled graph of the array. Nice! I had no idea I was fixing this.

1reaction
chrisroatcommented, Sep 27, 2021

The fix by #8174 by @gjoseph92 fixed this issue. In my testing, the 8000 block reproducer finished in <8s with minimal memory usage.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Help topic 30: High CPU Usage/Long Time to Load Chart Data
This page explains solutions to the problems of: High CPU usage. The user interface of Sierra Chart becoming inactive for a noticeable period...
Read more >
Slow is Fast: Rethinking In-Memory Graph Analysis with ...
latency for its data processing. In this paper, we explore and uncover the challenges that in-memory graph processing suffers from. Our system-level.
Read more >
What is processing in memory (PIM) and how does it work?
If data volumes are large, processing may slow down as data has to be swapped in and out of memory, which could also...
Read more >
Tips to improve performance - Visual Studio (Windows)
If you are typically running low on memory during debugging sessions, you can optimize performance by making one or more configuration changes.
Read more >
Analyzing Large-Scale In-Memory Graphs on GPUs - Yuede Ji
In this paper, we design the first large-scale in-memory graphs computation framework on Graphics Processing Units (GPUs), SWARMGRAPH. To.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found