question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Dask workers spilling to disk too early

See original GitHub issue

What happened:

I have a LocalCluster with ~50 workers. I use pure Dask Dataframe operations to create a dataframe with a large task graph (~3 million tasks).

When this dataframe is submitted to the LocalCluster (e.g. ddf.to_csv()), the compute finishes successfully, however at no point does RAM usage go above ~25% (the highest I have seen an individual worker go is ~35%). When I monitor the file system, I can see that dask-worker-space gets very large (>100GB).

I believe this is costing me lots of performance due to de/serialisation costs (I see tasks in the dashboard like e.g. disk-write-read-csv…).

I have set the dask config to as below (terminate at >100% as my data is skewed across workers - a single one breaking 100% isn’t an issue):

distributed:
  worker:
    memory:
      target: 0.7
      spill: 0.9
      pause: 1.0
      terminate: 1.2

When I have the cluster computing, I have attached a new client to it and run checks that the workers have been configured correctly (it appears they have):


#returns 0.7 as expected
client.run(lambda dask_worker:dask_worker.memory_target_fraction)

#returns 0.9 as expected
client.run(lambda dask_worker:dask_worker.memory_spill_fraction)

#returns 1.0 as expected
client.run(lambda dask_worker:dask_worker.memory_pause_fraction)

I have DEBUG logs turned on, and also do not see any lines indicating spill-to-disk has started as per https://github.com/dask/distributed/blob/6a32208717c04aa9ccb4defdf253b8a00cab32b7/distributed/worker.py#L2667-L2669

If I set worker.memory.target to False, I continue to get low RAM usage (~25% per worker) and large dask-worker-space.

If I set worker.memory.spill to False, I continue to get low RAM usage (~25% per worker, apart from 1 of my workers which is at ~50%) and large dask-worker-space.

If I set worker.memory.target and worker.memory.spill to False, I see dask-worker-space remains empty and RAM usage steadily climbs. This unfortunately is not a solution for me, as my data doesn’t wholly fit into memory, so this will eventually fail.

What you expected to happen:

  • My workers to wait until they are at ~70% RAM util before dask-worker-space starts filling up.

Minimal Complete Verifiable Example:

I think colleague of mine got one using dask.datasets.timeseries - I will ask him to share it here.

# Put your MCVE code here

Anything else we need to know?:

Environment:

  • Dask version: 2.23.0 (seen in 2.19.0 also, possibly earlier)
  • Python version: 3.7.7
  • Operating System: Linux
  • Install method (conda, pip, source): conda

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:17 (14 by maintainers)

github_iconTop GitHub Comments

1reaction
kylejn27commented, Aug 21, 2020

MCVE

Most of the time when I run this, my workers get up to 25% memory and then I see lots of disk-write-* tasks. I have all default configurations.

I tried to create a gif of my workers, it showed the issue, but the gif was too large to be uploaded. When I tried to recreate the gif my workers pushed to 60% capacity and then spilled to disk as expected. So i’m not quite sure what happened.

from distributed import Client
import dask

client = Client(threads_per_worker=1, n_workers=4)
# W##orkers: 4
# Cores: 4
# Memory: 17.18 GB

df.memory_usage(index=True).sum().compute() / 1e9 
# 17.8199272 GB

df = dask.datasets.timeseries(
    start="2000-01-01",
    end="2010-01-31",
    freq="1s",
    partition_freq="1d",
    dtypes={"name": int, "id": int, "x": float, "y": float,"a":int,"b":int}
)
df.persist()

Environment info:

dask                      2.22.0                     py_0    conda-forge
dask-core                 2.22.0                     py_0    conda-forge
distributed               2.22.0           py37hc8dfbb8_0    conda-forge
pandas                    1.0.5            py37h94625e5_0    conda-forge
numpy                     1.18.5           py37h7687784_0    conda-forge

EDIT Ok, I took a gif of my gif and made it smaller. Hopefully this demonstrates what I’m seeing well enough. The gif doesn’t show it but it bounces between 20 and 25% memory use on each worker for a while. Then the disk-write-* tasks show up

gifofagif

0reactions
KrishanBhasincommented, Oct 13, 2020

In summary:

  • It appears that the Workers’ two methods of identifying memory use have been strongly disagreeing in my use case. dask.sizeof() seems to massively overestimate memory use compared to psutil.
  • I solved my issue by setting memory.target to False in the configuration file, relying on psutil’s memory monitoring to trigger a spill-to-disk at my memory.spill target.
    • This initially did not work due to how False was propagating through to Zict and how it decided whether to spill-to-disk (it was spilling everything immediately), which has since been resolved via #4156

I am not clear on why dask.sizeof() is massively overestimating my memory use, and am not able to dedicate time to investigating this further at the moment. I suspect it may be related to my use of columns containing the object dtype to store strings.

I’m going to close this issue as it has been (mostly) understood, and I hope this summary helps any future users of Dask/Distributed who encounter similar behaviour!

(if closing the issue is the wrong action, please feel free to reopen)

Read more comments on GitHub >

github_iconTop Results From Across the Web

Worker Memory Management - Dask.distributed
Because Dask makes decisions (spill-to-disk, pause, terminate, rebalance() ) based on the worker's memory usage as reported by the OS, and is unaware...
Read more >
How can I force a dask worker to not write data to disk?
What you're doing should work fine. I'm surprised that it doesn't for you. The official recommendation is to use false values, ...
Read more >
Reducing memory usage in Dask workloads by 80% - Coiled
This is mostly because they no longer load too much data into memory and then have to spill it to disk, which creates...
Read more >
calculations spill to disk? · dask · Zulip Chat Archive
Using 36 dask workers (375GB memory), I am unable to perform ... That didn't actually work for me- I was still spilling to...
Read more >
An Introduction to Memory Spilling | Mads R. B. Kristensen
Memory spilling is an important feature that makes it possible to run Dask applications that would otherwise run out of memory.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found