Spill to disk may cause data duplication
See original GitHub issueIn aggressive spill-to-disk scenarios I observed that distributed may spill all the data it has in memory while still complaining with the following message that there is no more data to spill
Memory use is high but worker has no data " "to store to disk. Perhaps some other process " "is leaking memory? Side note: In our setup every worker runs in an isolated container so the chance of another process interfering with it is virtually zero.
It is true that these workers do indeed still hold on to a lot of memory without it being very transparent about where the data is. GC is useless, i.e. it is not related to https://github.com/dask/zict/issues/19
Investigating this issue let me realise that we may try to spill data which is actually currently in use. The most common example for this is probably once the worker collected the dependencies and schedules the execution of the task, see here. Another one would be when data is requested form a worker and we’re still serializing/submitting it. I could prove this assumption by patching the worker code, tracking the keys in-execution/in-spilling with some logging and it turns out that for some jobs I hold on to multiple GBs of memory although it was supposedly already spilled.
If we spill data which is currently still in use, this is not only misleading to the user, since the data is still in memory, but it may also cause heavy data duplication if the spilled-but-still-in-memory dependency is requested by another worker since the original data is still in memory but the buffer would fetch the key from the slow store and materialise it a second time since it doesn’t know of the original object anymore. If this piece of data is requested by more than one worker, the data could be duplicated multiple times even.
In non-aggressive spill-to-disk scenarios we should be protected from this by the LRU in the buffer but if memory pressure is high, the spilling might actually worsen the situation in these cases.
My practical approach to this would be to introduce something like a (un)lock_key_in_store
method to the buffer which protects the key from spilling and manually (un)setting this in the distributed code. If there is a smarter approach, I’d be glad to hear about it.
Also, If my reasoning is somewhere flawed, I’d appreciate feedback since, so far, I could only prove that we try to spill data currently in use but, so far, the duplication is just theory.
Related issues: https://github.com/dask/dask/issues/2456
Issue Analytics
- State:
- Created 3 years ago
- Comments:13 (11 by maintainers)
Top GitHub Comments
A very naive worst case calculation about how severe this issue is.
We’re targeting
get_data
requests of about 50MB (Worker.target_message_size
). Let’s take this as a reference size for a keys size. In realistic workloads I wouldn’t be surprised to see much larger data keys. The original report here described a shuffle for which 50MB splits are reasonable (input 2GB / 32 splits ~ 62MB).The worker limits concurrent incoming
get_data
requests toWorker.total_in_connections == 10 (default)
but allows this to double for same-host connections, i.e.#key_size * #total_in_commections * #same_host_mult
That yields a 500-1000MB data duplication for
get_data
assuming the LRU is missed consistently which I assume is approximately true for a shuffle.The duplication on
Worker.execute
side scales linearly with the number of threads and the number of input dependents, i.e.#key_size * #dependencies * #num_threads
. Again using 50MB per key and assuming 32 dependencies for a shuffle, that yields1600MB
per thread.Let’s have a look at a worst case, all workers on the same machine, max branching of 32 in a shuffle, key size at 50MB and 2 threads per worker (the two is pretty random, I have to admit), that’s up to 84 duplications (10 * 2 + 32 * 2) of a keys data or
4200MB
unmanaged memory caused by our spillage layer. various data copies and overhead of the UDF is on top of that.I guess this worst case is not too far off from a LocalCluster “deployment” on a big VM. However, most of the numbers are mere guessing and this can easily be scaled up or down so please take this with a healthy pinch of salt. Either way, duplicates also mean duplicate effort in deserialization and disk IO so there is a case to be made to avoid these regardless of the memory footprint.
From my understanding the zero-copy fix doesn’t help since the duplication I am referring to here is not caused by memcopies but rather by having distinct python objects. I’ll try to explain with some simplified pseudo worker/zict buffer
Let’s assume, for simplicity, the data we’re concerned about is large and will be spilled by the buffer immediately because it is beyond a given threshold
memory_target = memory_limit * memory_target_fraction
. In this case, whenever the data is accessed, the buffer will read it from storage and will create a new data instance without storing a ref to it in it’s data. This instance is returned to the caller and every subsequent call to the buffer will do the exact same thing. Load data, create new instance, return without holding a ref.In this extreme situation this is hopefully easy to follow. In more realisitc scenarios we’re spilling data concurrently to other operations. For instance, data is being used as part of an execute but the buffer spills the data. This results in the data no longer being tracked by the buffer (in our new terminology this means the data is “unmanaged memory”) but it is still in memory since it is being used by the task execution. If another worker then requests this piece of data we just spilled, it will load it from store and create a new instance and will start serialization. Simultaneously X other workers could do the same, resulting in X (get_data) + 1 (executing) copies of the same data because our buffer doesn’t know about the keys still being used.
While typing, I’m wondering if we can’t fix this by being a bit smart with weakrefs but I currently don’t have time to test this idea.
Edit: Just in case I haven’t made myself clear, yet. I believe this issue is caused by the
zict.Buffer
, not our serialization protocol or any other thing distributed is doing. This issue might be better off in the zict repo but I believe the visibility here is more important (and we have “our own”Buffer
subclass, by now, anyways https://github.com/dask/distributed/blob/8c73a18ba5c2ffe61977ea936da8ffeacb815c61/distributed/spill.py#L16)