question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Worker memory not being freed when tasks complete

See original GitHub issue

I’m still investigating, but in the meantime I wanted to get this issue started.

I’m noticing that after executing a task graph with large inputs and a small output, my worker memory stays high. In the example below we

  1. Generate data (large byte strings)
  2. filter data (slice)
  3. reduce many tasks (sum)

So the final result returned to the client is small, a single Python int. The only large objects should be the initially generated bytestrings.

The console output below is

  1. per-worker memory usage before the computation (~30 MB)
  2. per-worker memory usage right after the computation (~ 230 MB)
  3. per-worker memory usage 5 seconds after, in case things take some time to settle down. (~ 230 MB)
Memory usage [before]
{'tcp://192.168.7.20:50533': '30.92 MB', 'tcp://192.168.7.20:50534': '30.95 MB'}
running
Memory usage [after]
{'tcp://192.168.7.20:50533': '231.97 MB',
 'tcp://192.168.7.20:50534': '232.63 MB'}
Memory usage [after]
{'tcp://192.168.7.20:50533': '232.05 MB',
 'tcp://192.168.7.20:50534': '232.63 MB'}

In an effort to test whether the scheduler or worker is holding a reference to the data, I submit a bunch of tiny inc tasks to one of the worker. I notice that the memory on that worker does settle down

Memory usage [final]
{'tcp://192.168.7.20:52114': '232.77 MB',
 'tcp://192.168.7.20:52115': '49.73 MB'}

That’s at least consistent with the worker or scheduler holding a reference to the data, but there could be many other causes. I’m still debugging.

The number of inc tasks, 2731, seems to be significant. With 2730 inc tasks, I don’t see any memory reduction on that worker.

import time
from dask.utils import parse_bytes, format_bytes
import pprint
import string
import toolz
from distributed import Client, wait

N = parse_bytes("100 Mb")
I = 20


def inc(x):
    return x + 1


def f(x, n=N):
    time.sleep(0.05)
    return string.ascii_letters[x % 52].encode() * n


def g(x):
    time.sleep(0.02)
    return x[:5]


def h(*args):
    return sum(x[0] for x in args)


def get_mem(dask_worker):
    return dask_worker.monitor.proc.memory_info().rss


def main():
    dsk = {}
    for i in range(I):
        dsk[f'a-{i}'] = (f, i, N)
        dsk[f'b-{i}'] = (g, f'a-{i}')
    dsk['c-0'] = (h,) + tuple(f'b-{i}' for i in range(I))

    with Client(n_workers=2, threads_per_worker=1, memory_limit='500Mb', processes=True) as client:
        print("Memory usage [before]")
        pprint.pprint(toolz.valmap(format_bytes, client.run(get_mem)))

        print("running")
        client.get(dsk, keys=["c-0"])
        time.sleep(2)  # let things settle

        print("Memory usage [after]")
        pprint.pprint(toolz.valmap(format_bytes, client.run(get_mem)))

        time.sleep(5)  # settle some more?
        print("Memory usage [after]")
        pprint.pprint(toolz.valmap(format_bytes, client.run(get_mem)))


        print("clear things?")
        futures = client.map(inc, range(2731), pure=False)
        wait(futures)
        del futures

        print("Memory usage [final]")
        pprint.pprint(toolz.valmap(format_bytes, client.run(get_mem)))


if __name__ == '__main__':
    main()

Issue Analytics

  • State:open
  • Created 4 years ago
  • Reactions:10
  • Comments:25 (14 by maintainers)

github_iconTop GitHub Comments

5reactions
gioxc88commented, Dec 8, 2020

This problem is indeed a big one, preventing me to use dask in production where I have a very long running task and 200 gigs of memory get used in not time Is there anything at all you can suggest to investigate the issue or try to mitigate it?

I already tried the suggested PR without success. My structure is made by nested multiprocessing on different layers like:

level A                            A1
level B                  B1                 D2
level C             C1        C2       C3        C4 
level D           D1  D2    D3  D4   D5  D6    D7  D8         

A1 runs in parallel (B1, B2)

B1 runs in parallel (C1, C2) B2 runs in parallel (C3, C4)

C1 runs in parallel (D1, D2) C2 runs in parallel (D3, D4) C3 runs in parallel (D6, D5) C4 runs in parallel (D7, D8)

And everything works fine except for the fact that once the tasks in the most inner layer D are completed the memory never gets released and it accumulates until the kernel dies.

In this situation of nested processes I cannot even restart the client in the inner layers because this will end up affecting the whole computation. So for me there is really no solution here.

Any help would be much appreciated.

2reactions
vladaburiancommented, Jan 13, 2021

I’m also experiencing some king of memory leak, though it might not be related. I’m using only Dask distributed as a job scheduler, not even passing any substantial data. The input is just a filename and there is no return value. And the job itself is calling only bare pandas and numpy. This way I’m processing 4000 files (almost equaly sized) on 40 core machine in cca 45 minutes.

With Dask distributed the memory usage continuously increases until the work is done. At that point it’s consuming 40 GB and the memory is not freed. I see distributed.worker - WARNING - Memory use is high but worker has no data to store to disk. Perhaps some other process is leaking memory? in logs, which seems alright because Dask is not involved in any data manipulation.

The strange thing is that I don’t experience memory leak when using multiprocessing.Pool to dispatch those jobs. (And I also get better CPU usage.) Though it also consumes cca 10 GB which is not freed entirely, but the usage is almost constant all the time.

So it seams that Dask is not directly involved, yet it makes the difference somehow.

I’m running Debian buster, python3.7 and latest libraries (dask==2020.12.0, numpy==1.19.5, pandas==1.2.0). (Python3.8 seems to make no difference.)

There are htop leak-screenshots.zip.

See the code...

def compute_profile_from_file(filepath):
    df = pd.read_csv(filepath, compression="gzip", sep=";", names=("eid","t","src","spd"))
    # I observe memory leak even if just reading the data.
    # As I add more processing steps more memory is leaked.
    ...
    df.to_parquet(...)


def main_dask():
    fs = fsspec.filesystem("file")
    filepaths = fs.glob("/data/*.csv.gz")

    client = Client(LocalCluster(n_workers=40, threads_per_worker=1))
    results = set()

    for filepath in filepaths:
        if len(results) == 40:
            _, results = wait(results, return_when='FIRST_COMPLETED')

        job = client.submit(compute_profile_from_file, filepath)
        results.add(job)

    client.gather(results);
    del results
    
    time.sleep(24*3600)


def main_mp():
    fs = fsspec.filesystem("file")
    filepaths = fs.glob("/data/*.csv.gz")

    import multiprocessing as mp
    mp.set_start_method('spawn')

    pool = mp.Pool(40)
    pool.map(compute_profile_from_file, filepaths)

    time.sleep(24*3600)


if __name__ == "__main__":
    #main_dask()
    #main_mp()

Read more comments on GitHub >

github_iconTop Results From Across the Web

celery workers not releasing memory after tasks are finished
Usually, Unix processes do not systematically return to the OS the memory they acquire from it when they no longer need it.
Read more >
Memory (RAM) not being released by Ray
High: It blocks me to complete my task. In my project pipeline, I have two actors, CamActor (producer) and StreamActor (consumer).
Read more >
[celery-users] Celery doesn't release memory after finishing ...
Hi, I'm experiencing a memory issue with Celery. It seems Celery doesn't release memory after finishing a task: I run a task requiring...
Read more >
How can I force memory to be released at the end of a dask ...
This is not because of tasks in memory. It is releasing memory as tasks finish but keeps just the result in memory. The...
Read more >
Everything you need to know about Memory Leaks in ...
We can create new objects without worrying about freeing them up from… ... The user waited 20 seconds for the download task to...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found