Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Massive memory (100GB) used by dask-scheduler

See original GitHub issue

Using rapids 0.14 with dask 2.17.0 python 3.6, conda, Ubuntu 16.04

I’m running xgboost using dask on GPUs. I do:

convert in-memory numpy frame -> dask distributed frame using from_array()
chunk the frames sufficiently for every worker (here 3 nodes, 2 GPUs/node each) has data as required so xgboost does not hang
Run dataset like 5M rows x 10 columns of airlines data

Notes:

Every time 1-3 is done it is in an isolate fork that dies at end of the fit. So whatever instances of client etc. are destroyed. Nothing remains on GPU, nothing remains in a process since it’s gone. So I don’t believe I need a client.close() call.
Even though these forks are gone, within the code I’m always using client as a context manager within a with statement. So again shouldn’t need a client.close() call or something like that.

I see my application use reasonable amount of memory based upon that dataset. I see workers using not much memory at all, like 2-3%.

However, the dask-scheduler is using 70% of my 128GB system! I don’t understand how/why since the scheduler shouldn’t hold data as far as I understand. Perhaps the above sequence of sending dask frame to xgboost is a problem, but it would be odd that task graph is forced to hold data.

Even if a single graph held data, which is already a problem, there’s no way 90GB are needed to hold the data involved, so it’s like there is repeatedly old data being stored.

I don’t have code to share to repro since it’s not easy to extract, but I’m hoping still for ideas. I will work on a repro, but any fixes/ideas would be good.

Issue Analytics

State:
Created 3 years ago
Reactions:1
Comments:31 (14 by maintainers)

Top GitHub Comments

1reaction

trivialfiscommented, Nov 20, 2020

Those are useful references. I think it might be best for xgboost to list some of these discussions in its doc.

0reactions

martindurantcommented, Mar 19, 2021

Following the appearance of high-level-graphs in the scheduler, there is a good chance that this issue has been mitigated. I suggest closing.

Top Results From Across the Web

Dask Best Practices - Dask documentation

If you have a machine with 100 GB and 10 cores, then you might want to choose chunks in the 1GB range. You...

Analyzing memory management and performance in Dask-on ...

The goal of this blog is to compare the memory management and performance of "Dask-on-Ray'' versus Dask with its built-in scheduler.

Dask: why is memory usage blowing up? - Stack Overflow

What this does is put a copy of the dict on every worker and store reference to the object in fut_dict , obviating...

DASK | Handling Big Datasets For Machine Learning Using ...

It is a python library that can handle moderately large datasets on a ... In order to use lesser memory during computations, Dask...

Introduction — ML4Chem documentation

ML4Chem uses Dask which is a flexible library for parallel computing in Python. ... subsequently used for more calculations or just brought back...

Troubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.

Start Free

Top Related Reddit Thread

No results found

Top Related Tweet

No results found

Top Related Dev.to Post

No results found

Massive memory (100GB) used by dask-scheduler

Issue Analytics

Top GitHub Comments

Top Results From Across the Web

Top Related Medium Post

Top Related StackOverflow Question

Troubleshoot Live Code

Top Related Reddit Thread

Top Related Hackernoon Post

Top Related Tweet

Top Related Dev.to Post

Top Related Hashnode Post

array.overlap and array.map_overlap block sizes are incorrect when depth is an unsigned bit type

When overwriting partitioned parquet files, Dask may leave old partitions present.