question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Massive memory (100GB) used by dask-scheduler

See original GitHub issue

Using rapids 0.14 with dask 2.17.0 python 3.6, conda, Ubuntu 16.04

I’m running xgboost using dask on GPUs. I do:

  1. convert in-memory numpy frame -> dask distributed frame using from_array()
  2. chunk the frames sufficiently for every worker (here 3 nodes, 2 GPUs/node each) has data as required so xgboost does not hang
  3. Run dataset like 5M rows x 10 columns of airlines data

Notes:

  1. Every time 1-3 is done it is in an isolate fork that dies at end of the fit. So whatever instances of client etc. are destroyed. Nothing remains on GPU, nothing remains in a process since it’s gone. So I don’t believe I need a client.close() call.
  2. Even though these forks are gone, within the code I’m always using client as a context manager within a with statement. So again shouldn’t need a client.close() call or something like that.

I see my application use reasonable amount of memory based upon that dataset. I see workers using not much memory at all, like 2-3%.

However, the dask-scheduler is using 70% of my 128GB system! I don’t understand how/why since the scheduler shouldn’t hold data as far as I understand. Perhaps the above sequence of sending dask frame to xgboost is a problem, but it would be odd that task graph is forced to hold data.

Even if a single graph held data, which is already a problem, there’s no way 90GB are needed to hold the data involved, so it’s like there is repeatedly old data being stored.

image

I don’t have code to share to repro since it’s not easy to extract, but I’m hoping still for ideas. I will work on a repro, but any fixes/ideas would be good.

Issue Analytics

  • State:open
  • Created 3 years ago
  • Reactions:1
  • Comments:31 (14 by maintainers)

github_iconTop GitHub Comments

1reaction
trivialfiscommented, Nov 20, 2020

Those are useful references. I think it might be best for xgboost to list some of these discussions in its doc.

0reactions
martindurantcommented, Mar 19, 2021

Following the appearance of high-level-graphs in the scheduler, there is a good chance that this issue has been mitigated. I suggest closing.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Dask Best Practices - Dask documentation
If you have a machine with 100 GB and 10 cores, then you might want to choose chunks in the 1GB range. You...
Read more >
Analyzing memory management and performance in Dask-on ...
The goal of this blog is to compare the memory management and performance of "Dask-on-Ray'' versus Dask with its built-in scheduler.
Read more >
Dask: why is memory usage blowing up? - Stack Overflow
What this does is put a copy of the dict on every worker and store reference to the object in fut_dict , obviating...
Read more >
DASK | Handling Big Datasets For Machine Learning Using ...
It is a python library that can handle moderately large datasets on a ... In order to use lesser memory during computations, Dask...
Read more >
Introduction — ML4Chem documentation
ML4Chem uses Dask which is a flexible library for parallel computing in Python. ... subsequently used for more calculations or just brought back...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found