Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Spill to disk may cause data duplication

See original GitHub issue

In aggressive spill-to-disk scenarios I observed that distributed may spill all the data it has in memory while still complaining with the following message that there is no more data to spill

Memory use is high but worker has no data " "to store to disk. Perhaps some other process " "is leaking memory? Side note: In our setup every worker runs in an isolated container so the chance of another process interfering with it is virtually zero.

It is true that these workers do indeed still hold on to a lot of memory without it being very transparent about where the data is. GC is useless, i.e. it is not related to https://github.com/dask/zict/issues/19

Investigating this issue let me realise that we may try to spill data which is actually currently in use. The most common example for this is probably once the worker collected the dependencies and schedules the execution of the task, see here. Another one would be when data is requested form a worker and we’re still serializing/submitting it. I could prove this assumption by patching the worker code, tracking the keys in-execution/in-spilling with some logging and it turns out that for some jobs I hold on to multiple GBs of memory although it was supposedly already spilled.

If we spill data which is currently still in use, this is not only misleading to the user, since the data is still in memory, but it may also cause heavy data duplication if the spilled-but-still-in-memory dependency is requested by another worker since the original data is still in memory but the buffer would fetch the key from the slow store and materialise it a second time since it doesn’t know of the original object anymore. If this piece of data is requested by more than one worker, the data could be duplicated multiple times even.

In non-aggressive spill-to-disk scenarios we should be protected from this by the LRU in the buffer but if memory pressure is high, the spilling might actually worsen the situation in these cases.

My practical approach to this would be to introduce something like a (un)lock_key_in_store method to the buffer which protects the key from spilling and manually (un)setting this in the distributed code. If there is a smarter approach, I’d be glad to hear about it.

Also, If my reasoning is somewhere flawed, I’d appreciate feedback since, so far, I could only prove that we try to spill data currently in use but, so far, the duplication is just theory.

Issue Analytics

State:
Created 3 years ago
Comments:13 (11 by maintainers)

Top GitHub Comments

1reaction

fjettercommented, Jan 31, 2022

A very naive worst case calculation about how severe this issue is.

We’re targeting get_data requests of about 50MB (Worker.target_message_size). Let’s take this as a reference size for a keys size. In realistic workloads I wouldn’t be surprised to see much larger data keys. The original report here described a shuffle for which 50MB splits are reasonable (input 2GB / 32 splits ~ 62MB).

The worker limits concurrent incoming get_data requests to Worker.total_in_connections == 10 (default) but allows this to double for same-host connections, i.e. #key_size * #total_in_commections * #same_host_mult

That yields a 500-1000MB data duplication for get_data assuming the LRU is missed consistently which I assume is approximately true for a shuffle.

The duplication on Worker.execute side scales linearly with the number of threads and the number of input dependents, i.e. #key_size * #dependencies * #num_threads. Again using 50MB per key and assuming 32 dependencies for a shuffle, that yields 1600MB per thread.

Let’s have a look at a worst case, all workers on the same machine, max branching of 32 in a shuffle, key size at 50MB and 2 threads per worker (the two is pretty random, I have to admit), that’s up to 84 duplications (10 * 2 + 32 * 2) of a keys data or 4200MB unmanaged memory caused by our spillage layer. various data copies and overhead of the UDF is on top of that.

I guess this worst case is not too far off from a LocalCluster “deployment” on a big VM. However, most of the numbers are mere guessing and this can easily be scaled up or down so please take this with a healthy pinch of salt. Either way, duplicates also mean duplicate effort in deserialization and disk IO so there is a case to be made to avoid these regardless of the memory footprint.

1reaction

fjettercommented, Aug 18, 2021

From my understanding the zero-copy fix doesn’t help since the duplication I am referring to here is not caused by memcopies but rather by having distinct python objects. I’ll try to explain with some simplified pseudo worker/zict buffer

class ZictSpillData:
    def __getitem__(self, key):
        if key in self.memory:
            return self.memory[key]
        else:
            data =  load_from_store(key)
            if sizeof(data) < self.memory_target:
                self.memory[key] = data
            return data


class Worker:
    async def execute(self, ts, ...):
        assert isinstance(self.data, ZictSpillData)
        data = {}
        for key in ts.dependencies:
            data[key] = self.data[key]
            assert key not in self.data.memory  # data was too large, the buffer forgot it again
        run_task(ts, data)
    
    def get_data(self, keys, ...):
        data = {}
        for key in ts.dependencies:
            data[key] = self.data[key]
            assert key not in self.data.memory  # data was too large, the buffer forgot it again
        return data

Let’s assume, for simplicity, the data we’re concerned about is large and will be spilled by the buffer immediately because it is beyond a given threshold memory_target = memory_limit * memory_target_fraction. In this case, whenever the data is accessed, the buffer will read it from storage and will create a new data instance without storing a ref to it in it’s data. This instance is returned to the caller and every subsequent call to the buffer will do the exact same thing. Load data, create new instance, return without holding a ref.

In this extreme situation this is hopefully easy to follow. In more realisitc scenarios we’re spilling data concurrently to other operations. For instance, data is being used as part of an execute but the buffer spills the data. This results in the data no longer being tracked by the buffer (in our new terminology this means the data is “unmanaged memory”) but it is still in memory since it is being used by the task execution. If another worker then requests this piece of data we just spilled, it will load it from store and create a new instance and will start serialization. Simultaneously X other workers could do the same, resulting in X (get_data) + 1 (executing) copies of the same data because our buffer doesn’t know about the keys still being used.

While typing, I’m wondering if we can’t fix this by being a bit smart with weakrefs but I currently don’t have time to test this idea.

Edit: Just in case I haven’t made myself clear, yet. I believe this issue is caused by the zict.Buffer, not our serialization protocol or any other thing distributed is doing. This issue might be better off in the zict repo but I believe the visibility here is more important (and we have “our own” Buffer subclass, by now, anyways https://github.com/dask/distributed/blob/8c73a18ba5c2ffe61977ea936da8ffeacb815c61/distributed/spill.py#L16)