question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Memory errors on distributed dask cluster

See original GitHub issue

Description

I have a persisted dask dataframe which is larger than the amount of memory on my notebook server and any of the individual workers. The structure is x, y, z lidar data.

When trying to plot with datashader it seems to attempt to transfer the whole dataframe to the notebook when aggregating before plotting.

ddf = client.persist(dd.read_parquet('Some 20GB dataset'))
cvs = ds.Canvas(900, 525)
agg = cvs.points(ddf, 'x', 'y', agg=ds.mean('z'))

This results in 20GB of data being transferred to my notebook (and it gets killed by the OOM killer as I only have 16GB of RAM).

Your environment

Datashader version: 0.6.8 Dask version: 0.20.0 Distributed version: 1.24.0

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Comments:16 (7 by maintainers)

github_iconTop GitHub Comments

1reaction
jonmmeasecommented, Jan 12, 2019

Hi @jacobtomlinson,

Wanted to let you know that I’m planning to take a look at this, as it’s definitely an important usecase (and it’s something that Datashader+Dask should be able to handle). But unfortunately it probably won’t be until early February that I’ll have a compute/storage environment setup to be able to reproduce what you’re seeing.

1reaction
mrocklincommented, Dec 20, 2018

groupby-aggregations are computed by doing groupby aggregations on the partitions, then merging a few, doing more groupby-aggreations on those, and so on in a tree reduction until we get to a final result. There is never much memory in any particular partition (assuming that the number of groups is managable)

As an example, we accomplish a groupby-mean by doing a groupby-sum and groupby-count on each partition, then doing a groupby-sum on both of those until we get down to one, then dividing the result on the final partition.

However, datashader does different things than dask.dataframe. I’m not as familiar with their algorithms, but I suspect that they do something similar.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Managing Memory — Dask.distributed 2022.12.1 documentation
Dask.distributed stores the results of tasks in the distributed memory of the worker nodes. The central scheduler tracks all data on the cluster...
Read more >
Possible memory leak when using LocalCluster #5960 - GitHub
What happened: Memory usage of code using da.from_array and compute in a for loop grows over time when using a LocalCluster .
Read more >
dask distributed memory error - Stack Overflow
This loads all of the data into RAM across the cluster (which is fine), and then tries to bring the entire result back...
Read more >
Tackling unmanaged memory with Dask - Coiled
Since distributed 2021.04.1, the Dask dashboard breaks down the memory usage of each worker and of the cluster total: In the graph we...
Read more >
Troubleshooting and Optimizing Dask Resources | Saturn Cloud
Dask Cluster Settings · Memory-related errors · If you use multiple workers but they aren't all running at high utilization · If you...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found