Massive memory (100GB) used by dask-scheduler
See original GitHub issueUsing rapids 0.14 with dask 2.17.0 python 3.6, conda, Ubuntu 16.04
I’m running xgboost using dask on GPUs. I do:
- convert in-memory numpy frame -> dask distributed frame using from_array()
- chunk the frames sufficiently for every worker (here 3 nodes, 2 GPUs/node each) has data as required so xgboost does not hang
- Run dataset like 5M rows x 10 columns of airlines data
Notes:
- Every time 1-3 is done it is in an isolate fork that dies at end of the fit. So whatever instances of client etc. are destroyed. Nothing remains on GPU, nothing remains in a process since it’s gone. So I don’t believe I need a client.close() call.
- Even though these forks are gone, within the code I’m always using client as a context manager within a
with
statement. So again shouldn’t need a client.close() call or something like that.
I see my application use reasonable amount of memory based upon that dataset. I see workers using not much memory at all, like 2-3%.
However, the dask-scheduler is using 70% of my 128GB system! I don’t understand how/why since the scheduler shouldn’t hold data as far as I understand. Perhaps the above sequence of sending dask frame to xgboost is a problem, but it would be odd that task graph is forced to hold data.
Even if a single graph held data, which is already a problem, there’s no way 90GB are needed to hold the data involved, so it’s like there is repeatedly old data being stored.
I don’t have code to share to repro since it’s not easy to extract, but I’m hoping still for ideas. I will work on a repro, but any fixes/ideas would be good.
Issue Analytics
- State:
- Created 3 years ago
- Reactions:1
- Comments:31 (14 by maintainers)
Those are useful references. I think it might be best for xgboost to list some of these discussions in its doc.
Following the appearance of high-level-graphs in the scheduler, there is a good chance that this issue has been mitigated. I suggest closing.