Memory Leak? Very big graphs clogging scheduler?
See original GitHub issueI found something that feels like a memory leak, but I’m not really sure if it’s that or if it’s just overhead of using the scheduler with lots of tasks (seems unlikely, but I wanted to ask). I saw there were other posts about possible memory leaks, but my method of causing feels a bit different than those described elsewhere so I wanted to post my example code to generate and ask if there is anything obvious I am messing up or if it’s a known issue?
In my attempts to simply calculate lots of basic statistics on a large data-frame, I am now at a point with a small-ish example (runs on my local machine) that demonstrates a failure in dask that doesn’t happen in pandas on the same underlying code. (Of specific note is that I am setting the npartitions=100
to reproduce the approximate number of tasks that I am getting on larger machines / clusters I am trying to do similar work on.)
This is the smaller reproducible example of a problem I was facing on 4 TB memory setup, where 150 GB tables (approx) would be reported via dask to only be using 500 GB in memory during processing steps, but the machine would lock up and actually be using the full amount of RAM. For the larger machine I was starting the client via dask-worker
and dask-scheduler
CLI commands rather than through the client = Client()
setup.
Code Setup
## SETUP
run_with_dask = True
# Create some dummy data.
import numpy as np
import pandas as pd
data = {'col_{}'.format(i): np.random.uniform(size=500000) for i in range(500)}
pd_df = pd.DataFrame(data)
del data
if run_with_dask:
import dask.dataframe as dd
from distributed import Client
client = Client()
print(client)
df = dd.from_pandas(pd_df, npartitions=100)
df = client.persist(df)
else:
df = pd_df
del pd_df
## Statistic Generation
def generate_filter(n_cols=500):
return {'col': 'col_{}'.format(np.random.choice(n_cols)),
'op': np.random.choice(['lt', 'gt']),
'value': np.random.uniform(0.25, 0.75)}
def get_n_filter(n):
return [generate_filter() for x in range(n)]
def get_statistic_batch(filters, n, n_cols=500):
return [{'filters': filters,
'output': {'col': 'col_{}'.format(np.random.choice(n_cols)),
'op': np.random.choice(['avg', 'sum', 'max', 'min', 'stddev', 'unique_count'])
}
} for x in range(n)]
def random_stats(chunks, n_per_chunk):
stats = []
for i in range(chunks):
filt_n = np.random.randint(10)+1
stats.extend(get_statistic_batch(get_n_filter(filt_n), n_per_chunk))
return stats
## Statistic Parsing
def filter_index(df, filter):
filter_ops = {'lt': lambda x, y: x < y, 'gt': lambda x, y: x > y, 'eq': lambda x, y: x == y}
return filter_ops[filter['op']](df[filter['col']], filter['value'])
def get_indexer(df, filters):
if len(filters) == 1:
return filter_index(df, filters[0])
return np.logical_and(filter_index(df, filters[0]), get_indexer(df, filters[1:]))
def get_statistic(df, statistic):
indexer = get_indexer(df, statistic['filters'])
agg_ops = {'sum': np.sum, 'avg': np.mean, 'max': np.max, 'min': np.min, 'stddev': np.std, 'unique_count': lambda x: x.unique().size}
return agg_ops[statistic['output']['op']](df[statistic['output']['col']][indexer])
Benchmarking
import time
np.random.seed(137)
st = time.time()
single_stat = get_statistic(df, random_stats(1, 1)[0])
ft = time.time()
print("Single stat graph generation (or calculation with pandas): ", ft-st)
if run_with_dask:
print(len(single_stat.dask))
st = time.time()
final_number = single_stat.compute()
ft = time.time()
print("Single stat: compute via dask", ft-st)
st = time.time()
stats_5_100 = random_stats(5, 100)
batch_5_100 = [get_statistic(df, x) for x in stats_5_100]
ft = time.time()
print("500 stat generation (or calc with pandas). Lots of repeated filters.", ft-st)
if run_with_dask:
st = time.time()
final_results = client.compute(batch_5_100, sync=True)
ft = time.time()
print("500 stat generation, compute via dask.", ft-st)
st = time.time()
stats_100_5 = random_stats(100, 5)
batch_100_5 = [get_statistic(df, x) for x in stats_100_5]
ft = time.time()
print("500 stat generation (or calc with pandas). Few repeated filters",ft-st)
if run_with_dask:
st = time.time()
final_results = client.compute(batch_100_5, sync=True)
ft = time.time()
print("500 stat generation, compute via dask.", ft-st)
st = time.time()
stats_50_100 = random_stats(50, 100)
batch_50_100 = [get_statistic(df, x) for x in stats_50_100]
ft = time.time()
print("5000 stat generation.", ft-st)
if run_with_dask:
st = time.time()
final_results = client.compute(batch_50_100, sync=True)
ft = time.time()
print("5000 stat generation, compute via dask", ft-st)
Results
Running with run_with_dask = False
Memory use (via htop
) stays below 3 GB on my machine.
Single stat graph generation (or calculation with pandas): 0.00588536262512207
500 stat generation (or calc with pandas). Lots of repeated filters. 1.989380121231079
500 stat generation (or calc with pandas). Few repeated filters 1.928800106048584
5000 stat generation. 17.788148641586304
Running with run_with_dask = True
, Memory use via htop
fills up my machine (>15.6 GB) and starts swapping until everything dies. Also, watching the diagnostic dashboard shows 5.9 GB of in use by the cluster, but htop
shows a lot more use.
Stopping before the final batch even, after just running the two 500 stat generation batches
, shows a mismatch between reported Mem via dask and actual Mem via htop.
Extra note about seeing this problem even more viscerally: by editing data = {'col_{}'.format(i): np.random.uniform(size=500000) for i in range(500)}
to only create 5000 rows of data (size=5000
), the memory use still balloons to >13 GB in dask, while the scheduler dashboard reports 2 GB.
Output from run_with_dask = True
on this small dataframe:
Single stat graph generation (or calculation with pandas): 0.010093927383422852
2418
Single stat: compute via dask 1.0854308605194092
500 stat generation (or calc with pandas). Lots of repeated filters. 3.176414728164673
500 stat generation, compute via dask. 58.63161826133728
500 stat generation (or calc with pandas). Few repeated filters 3.3641154766082764
500 stat generation, compute via dask. 97.8330614566803
5000 stat generation. 31.91444993019104
distributed.utils_perf - WARNING - full garbage collections took 12% CPU time recently (threshold: 10%)
distributed.comm.tcp - WARNING - Closing dangling stream in <TCP local=tcp://127.0.0.1:50122 remote=tcp://127.0.0.1:40591>
distributed.comm.tcp - WARNING - Closing dangling stream in <TCP local=tcp://127.0.0.1:50124 remote=tcp://127.0.0.1:40591>
distributed.utils_perf - WARNING - full garbage collections took 11% CPU time recently (threshold: 10%)
distributed.comm.tcp - WARNING - Closing dangling stream in <TCP local=tcp://127.0.0.1:50126 remote=tcp://127.0.0.1:40591>
distributed.comm.tcp - WARNING - Closing dangling stream in <TCP local=tcp://127.0.0.1:50128 remote=tcp://127.0.0.1:40591>
distributed.comm.tcp - WARNING - Closing dangling stream in <TCP local=tcp://127.0.0.1:50130 remote=tcp://127.0.0.1:40591>
distributed.comm.tcp - WARNING - Closing dangling stream in <TCP local=tcp://127.0.0.1:50132 remote=tcp://127.0.0.1:40591>
distributed.comm.tcp - WARNING - Closing dangling stream in <TCP local=tcp://127.0.0.1:50134 remote=tcp://127.0.0.1:40591>
distributed.comm.tcp - WARNING - Closing dangling stream in <TCP local=tcp://127.0.0.1:50136 remote=tcp://127.0.0.1:40591>
distributed.comm.tcp - WARNING - Closing dangling stream in <TCP local=tcp://127.0.0.1:50138 remote=tcp://127.0.0.1:40591>
distributed.comm.tcp - WARNING - Closing dangling stream in <TCP local=tcp://127.0.0.1:50142 remote=tcp://127.0.0.1:40591>
distributed.utils_perf - WARNING - full garbage collections took 11% CPU time recently (threshold: 10%)
Environment
Python 3.7.1
dask==1.0.0
distributed==1.25.1
msgpack==0.5.6
My general problem as well: https://stackoverflow.com/questions/53844188/how-do-i-use-dask-to-efficiently-calculate-many-simple-statistics
Issue Analytics
- State:
- Created 5 years ago
- Comments:8 (4 by maintainers)
Top GitHub Comments
@TomAugspurger Thanks for your reply. I checked that optimize page and think it should be much helpful for reducing graph size. However, my project is written by the api of “client.submit” and the graph is dynamicly adjusted (with as_completed command to realize some optimizing algorithm). I have no idea how to apply fuse-like api in dask.optimization module in my case. Do you have suggestions about this ?
However, with some expeiments, I found some tricks to reduce total memory used:
With the steps listed above, the memory consumption is largely reduced and problem is almost sovled in my case.
No, I’m not really familiar with the details.