question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Memory Leak? Very big graphs clogging scheduler?

See original GitHub issue

I found something that feels like a memory leak, but I’m not really sure if it’s that or if it’s just overhead of using the scheduler with lots of tasks (seems unlikely, but I wanted to ask). I saw there were other posts about possible memory leaks, but my method of causing feels a bit different than those described elsewhere so I wanted to post my example code to generate and ask if there is anything obvious I am messing up or if it’s a known issue?

In my attempts to simply calculate lots of basic statistics on a large data-frame, I am now at a point with a small-ish example (runs on my local machine) that demonstrates a failure in dask that doesn’t happen in pandas on the same underlying code. (Of specific note is that I am setting the npartitions=100 to reproduce the approximate number of tasks that I am getting on larger machines / clusters I am trying to do similar work on.)

This is the smaller reproducible example of a problem I was facing on 4 TB memory setup, where 150 GB tables (approx) would be reported via dask to only be using 500 GB in memory during processing steps, but the machine would lock up and actually be using the full amount of RAM. For the larger machine I was starting the client via dask-worker and dask-scheduler CLI commands rather than through the client = Client() setup.

Code Setup

## SETUP

run_with_dask = True

# Create some dummy data.
import numpy as np
import pandas as pd

data = {'col_{}'.format(i): np.random.uniform(size=500000) for i in range(500)}
pd_df = pd.DataFrame(data)
del data

if run_with_dask:
    import dask.dataframe as dd
    from distributed import Client
    client = Client()
    print(client)
    df = dd.from_pandas(pd_df, npartitions=100)
    df = client.persist(df)
else:
    df = pd_df
    
del pd_df

## Statistic Generation

def generate_filter(n_cols=500):
    return {'col': 'col_{}'.format(np.random.choice(n_cols)),
            'op': np.random.choice(['lt', 'gt']),
            'value': np.random.uniform(0.25, 0.75)}

def get_n_filter(n):
    return [generate_filter() for x in range(n)]

def get_statistic_batch(filters, n, n_cols=500):
    return [{'filters': filters, 
             'output': {'col': 'col_{}'.format(np.random.choice(n_cols)),
                        'op': np.random.choice(['avg', 'sum', 'max', 'min', 'stddev', 'unique_count'])
                       }
            } for x in range(n)]

def random_stats(chunks, n_per_chunk):
    stats = []
    for i in range(chunks):
        filt_n = np.random.randint(10)+1
        stats.extend(get_statistic_batch(get_n_filter(filt_n), n_per_chunk))
    return stats

## Statistic Parsing

def filter_index(df, filter):
    filter_ops = {'lt': lambda x, y: x < y, 'gt': lambda x, y: x > y, 'eq': lambda x, y: x == y}
    return filter_ops[filter['op']](df[filter['col']], filter['value'])

def get_indexer(df, filters):
    if len(filters) == 1:
        return filter_index(df, filters[0])
    return np.logical_and(filter_index(df, filters[0]), get_indexer(df, filters[1:]))

def get_statistic(df, statistic):
    indexer = get_indexer(df, statistic['filters'])
    agg_ops = {'sum': np.sum, 'avg': np.mean, 'max': np.max, 'min': np.min, 'stddev': np.std, 'unique_count': lambda x: x.unique().size}
    return agg_ops[statistic['output']['op']](df[statistic['output']['col']][indexer])

Benchmarking

import time
np.random.seed(137)

st = time.time()
single_stat = get_statistic(df, random_stats(1, 1)[0])
ft = time.time()
print("Single stat graph generation (or calculation with pandas): ", ft-st)
if run_with_dask:
    print(len(single_stat.dask))
    st = time.time()
    final_number = single_stat.compute()
    ft = time.time()
    print("Single stat: compute via dask", ft-st)
    
st = time.time()
stats_5_100 = random_stats(5, 100)
batch_5_100 = [get_statistic(df, x) for x in stats_5_100]
ft = time.time()
print("500 stat generation (or calc with pandas). Lots of repeated filters.", ft-st)
if run_with_dask:
    st = time.time()
    final_results = client.compute(batch_5_100, sync=True)
    ft = time.time()
    print("500 stat generation, compute via dask.", ft-st)
    
st = time.time()
stats_100_5 = random_stats(100, 5)
batch_100_5 = [get_statistic(df, x) for x in stats_100_5]
ft = time.time()
print("500 stat generation (or calc with pandas). Few repeated filters",ft-st)
if run_with_dask:
    st = time.time()
    final_results = client.compute(batch_100_5, sync=True)
    ft = time.time()
    print("500 stat generation, compute via dask.", ft-st)
    
st = time.time()
stats_50_100 = random_stats(50, 100)
batch_50_100 = [get_statistic(df, x) for x in stats_50_100]
ft = time.time()
print("5000 stat generation.", ft-st)
if run_with_dask:
    st = time.time()
    final_results = client.compute(batch_50_100, sync=True)
    ft = time.time()
    print("5000 stat generation, compute via dask", ft-st)

Results

Running with run_with_dask = False Memory use (via htop) stays below 3 GB on my machine.

Single stat graph generation (or calculation with pandas):  0.00588536262512207
500 stat generation (or calc with pandas). Lots of repeated filters. 1.989380121231079
500 stat generation (or calc with pandas). Few repeated filters 1.928800106048584
5000 stat generation. 17.788148641586304

Running with run_with_dask = True, Memory use via htop fills up my machine (>15.6 GB) and starts swapping until everything dies. Also, watching the diagnostic dashboard shows 5.9 GB of in use by the cluster, but htop shows a lot more use.

Stopping before the final batch even, after just running the two 500 stat generation batches, shows a mismatch between reported Mem via dask and actual Mem via htop.

Extra note about seeing this problem even more viscerally: by editing data = {'col_{}'.format(i): np.random.uniform(size=500000) for i in range(500)} to only create 5000 rows of data (size=5000), the memory use still balloons to >13 GB in dask, while the scheduler dashboard reports 2 GB.

Output from run_with_dask = True on this small dataframe:

Single stat graph generation (or calculation with pandas):  0.010093927383422852
2418
Single stat: compute via dask 1.0854308605194092
500 stat generation (or calc with pandas). Lots of repeated filters. 3.176414728164673
500 stat generation, compute via dask. 58.63161826133728
500 stat generation (or calc with pandas). Few repeated filters 3.3641154766082764
500 stat generation, compute via dask. 97.8330614566803
5000 stat generation. 31.91444993019104
distributed.utils_perf - WARNING - full garbage collections took 12% CPU time recently (threshold: 10%)
distributed.comm.tcp - WARNING - Closing dangling stream in <TCP local=tcp://127.0.0.1:50122 remote=tcp://127.0.0.1:40591>
distributed.comm.tcp - WARNING - Closing dangling stream in <TCP local=tcp://127.0.0.1:50124 remote=tcp://127.0.0.1:40591>
distributed.utils_perf - WARNING - full garbage collections took 11% CPU time recently (threshold: 10%)
distributed.comm.tcp - WARNING - Closing dangling stream in <TCP local=tcp://127.0.0.1:50126 remote=tcp://127.0.0.1:40591>
distributed.comm.tcp - WARNING - Closing dangling stream in <TCP local=tcp://127.0.0.1:50128 remote=tcp://127.0.0.1:40591>
distributed.comm.tcp - WARNING - Closing dangling stream in <TCP local=tcp://127.0.0.1:50130 remote=tcp://127.0.0.1:40591>
distributed.comm.tcp - WARNING - Closing dangling stream in <TCP local=tcp://127.0.0.1:50132 remote=tcp://127.0.0.1:40591>
distributed.comm.tcp - WARNING - Closing dangling stream in <TCP local=tcp://127.0.0.1:50134 remote=tcp://127.0.0.1:40591>
distributed.comm.tcp - WARNING - Closing dangling stream in <TCP local=tcp://127.0.0.1:50136 remote=tcp://127.0.0.1:40591>
distributed.comm.tcp - WARNING - Closing dangling stream in <TCP local=tcp://127.0.0.1:50138 remote=tcp://127.0.0.1:40591>
distributed.comm.tcp - WARNING - Closing dangling stream in <TCP local=tcp://127.0.0.1:50142 remote=tcp://127.0.0.1:40591>
distributed.utils_perf - WARNING - full garbage collections took 11% CPU time recently (threshold: 10%)

Environment

Python 3.7.1
dask==1.0.0
distributed==1.25.1
msgpack==0.5.6

My general problem as well: https://stackoverflow.com/questions/53844188/how-do-i-use-dask-to-efficiently-calculate-many-simple-statistics

Issue Analytics

  • State:open
  • Created 5 years ago
  • Comments:8 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
songqiqqqcommented, Aug 12, 2019

@TomAugspurger Thanks for your reply. I checked that optimize page and think it should be much helpful for reducing graph size. However, my project is written by the api of “client.submit” and the graph is dynamicly adjusted (with as_completed command to realize some optimizing algorithm). I have no idea how to apply fuse-like api in dask.optimization module in my case. Do you have suggestions about this ?

However, with some expeiments, I found some tricks to reduce total memory used:

  1. start the client/child processes before loading large variable at parent processes. This could avoid redundant copies in child processes, especially for large variable.
  2. make the function definition clean. First, avoid directly referencing large varialbe at the outer scope. Second, moving all the function definition into a individual light py file(seperate from the environment of large variable loading/generation) seems helpful to the memory comsumption. (I don’t konw the mechanism, but it works for me)
  3. moniter the task states with the command “client.cluster.scheduler.tasks” and control the task stream according to that.

With the steps listed above, the memory consumption is largely reduced and problem is almost sovled in my case.

0reactions
TomAugspurgercommented, Aug 12, 2019

Do you have suggestions about this ?

No, I’m not really familiar with the details.

Read more comments on GitHub >

github_iconTop Results From Across the Web

7 Problems to Look out for When Analyzing Garbage ...
7 Problems to Look out for When Analyzing Garbage Collection Logs · 1. GC pauses cause apps to stall · 2. Memory leaks...
Read more >
- Finding memory leaks
Valgrind Memcheck tracks all allocated heap blocks so it can find memory leaks when the program terminates. The tool writes the leak details ......
Read more >
Finding Memory Leak in Go Service
We quickly ruled out garbage collection as the cause of our memory issue. Our active heap is very small (around 20 MiB), but...
Read more >
Understanding Memory Leaks in Java
A Memory Leak is a situation where there are objects present in the heap that are no longer used, but the garbage collector...
Read more >
Windows 2000 Memory Subsystem Tweaking
View image: /infopop/emoticons/icon_wink.gif The gain in freed up memory isn't particularly large though, so the tweak does fall into the realm of 'cool...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found