question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

FFT on modestly large array results in KilledWorker

See original GitHub issue

System Info

OS: GNU/Linux x86_64 Cores: 64 Memory: 252 GB distributed 1.21.0 dask 0.17.0

Minimum Example

# Import needed libraries
import numpy as np
import distributed
import dask.array
# Create local cluster
client = distributed.Client()
# Create Numpy array
shape = (500, 500, 1000)
big_array= np.random.rand(*shape)
# Create Dask array with appropriate chunk size
chunks = (int(shape[0]/20), int(shape[1]/20), shape[2])
big_dask_array = dask.array.from_array(big_array, chunks=chunks)
# Compute FFT
fft = dask.array.fft.rfft(big_dask_array, axis=2,  n=(2 * shape[2] + 1))
foo = fft.compute()

Explanation and Traceback

When calculating a fast Fourier transform on a modestly large array (that still fits in memory), I get repeated error messages that look like,

tornado.application - ERROR - Exception in callback <bound method Nanny.memory_monitor of <Nanny: tcp://127.0.0.1:32854, threads: 1>>
Traceback (most recent call last):
  File "/storage-home/w/wtb2/anaconda3/envs/synthesizar/lib/python3.6/site-packages/tornado/ioloop.py", line 1026, in _run
    return self.callback()
  File "/storage-home/w/wtb2/anaconda3/envs/synthesizar/lib/python3.6/site-packages/distributed/nanny.py", line 251, in memory_monitor
    self.process.process.terminate()
AttributeError: 'NoneType' object has no attribute 'terminate'

followed by the error (and traceback) seen below. Looking at the Bokeh dashboard for the local cluster, the jobs either hang or continually restart.

I’m a bit confused as to why the worker is being killed. The array sizes I’m using here easily fit into memory. Adjusting the chunk size either way does not seem to make a difference . The traceback does not provide much insight. Any idea what is going on here?

Reducing the array shape to (475,475,1000), I don’t seem to see this error. The tipping point seems to be around a (495,495).

---------------------------------------------------------------------------
KilledWorker                              Traceback (most recent call last)
<ipython-input-7-2e30612ecfe9> in <module>()
----> 1 foo = fft_1.compute()

~/anaconda3/envs/synthesizar/lib/python3.6/site-packages/dask/base.py in compute(self, **kwargs)
    141         dask.base.compute
    142         """
--> 143         (result,) = compute(self, traverse=False, **kwargs)
    144         return result
    145 

~/anaconda3/envs/synthesizar/lib/python3.6/site-packages/dask/base.py in compute(*args, **kwargs)
    390     postcomputes = [a.__dask_postcompute__() if is_dask_collection(a)
    391                     else (None, a) for a in args]
--> 392     results = get(dsk, keys, **kwargs)
    393     results_iter = iter(results)
    394     return tuple(a if f is None else f(next(results_iter), *a)

~/anaconda3/envs/synthesizar/lib/python3.6/site-packages/distributed/client.py in get(self, dsk, keys, restrictions, loose_restrictions, resources, sync, asynchronous, **kwargs)
   2039                 secede()
   2040             try:
-> 2041                 results = self.gather(packed, asynchronous=asynchronous)
   2042             finally:
   2043                 for f in futures.values():

~/anaconda3/envs/synthesizar/lib/python3.6/site-packages/distributed/client.py in gather(self, futures, errors, maxsize, direct, asynchronous)
   1476             return self.sync(self._gather, futures, errors=errors,
   1477                              direct=direct, local_worker=local_worker,
-> 1478                              asynchronous=asynchronous)
   1479 
   1480     @gen.coroutine

~/anaconda3/envs/synthesizar/lib/python3.6/site-packages/distributed/client.py in sync(self, func, *args, **kwargs)
    601             return future
    602         else:
--> 603             return sync(self.loop, func, *args, **kwargs)
    604 
    605     def __repr__(self):

~/anaconda3/envs/synthesizar/lib/python3.6/site-packages/distributed/utils.py in sync(loop, func, *args, **kwargs)
    251             e.wait(10)
    252     if error[0]:
--> 253         six.reraise(*error[0])
    254     else:
    255         return result[0]

~/anaconda3/envs/synthesizar/lib/python3.6/site-packages/six.py in reraise(tp, value, tb)
    691             if value.__traceback__ is not tb:
    692                 raise value.with_traceback(tb)
--> 693             raise value
    694         finally:
    695             value = None

~/anaconda3/envs/synthesizar/lib/python3.6/site-packages/distributed/utils.py in f()
    235             yield gen.moment
    236             thread_state.asynchronous = True
--> 237             result[0] = yield make_coro()
    238         except Exception as exc:
    239             logger.exception(exc)

~/anaconda3/envs/synthesizar/lib/python3.6/site-packages/tornado/gen.py in run(self)
   1053 
   1054                     try:
-> 1055                         value = future.result()
   1056                     except Exception:
   1057                         self.had_exception = True

~/anaconda3/envs/synthesizar/lib/python3.6/site-packages/tornado/concurrent.py in result(self, timeout)
    236         if self._exc_info is not None:
    237             try:
--> 238                 raise_exc_info(self._exc_info)
    239             finally:
    240                 self = None

~/anaconda3/envs/synthesizar/lib/python3.6/site-packages/tornado/util.py in raise_exc_info(exc_info)

~/anaconda3/envs/synthesizar/lib/python3.6/site-packages/tornado/gen.py in run(self)
   1061                     if exc_info is not None:
   1062                         try:
-> 1063                             yielded = self.gen.throw(*exc_info)
   1064                         finally:
   1065                             # Break up a reference to itself

~/anaconda3/envs/synthesizar/lib/python3.6/site-packages/distributed/client.py in _gather(self, futures, errors, direct, local_worker)
   1354                             six.reraise(type(exception),
   1355                                         exception,
-> 1356                                         traceback)
   1357                     if errors == 'skip':
   1358                         bad_keys.add(key)

~/anaconda3/envs/synthesizar/lib/python3.6/site-packages/six.py in reraise(tp, value, tb)
    691             if value.__traceback__ is not tb:
    692                 raise value.with_traceback(tb)
--> 693             raise value
    694         finally:
    695             value = None

KilledWorker: ('array-original-4f69bbca3af6e278592118459b4e2c5d', 'tcp://127.0.0.1:41577')

Issue Analytics

  • State:closed
  • Created 6 years ago
  • Comments:12 (12 by maintainers)

github_iconTop GitHub Comments

2reactions
andyljonescommented, Feb 23, 2018

Short: You might be running into a per-worker memory limit. Set worker-memory-terminate: False in ~/.dask/config.yml and see if it goes away. The default silence_logs level on LocalCluster might need to be turned down so that dummies like me don’t spend an afternoon chasing dead workers before deciding to turn it down themselves.

Long: I think I had a similar issue to this. I had a long-running computation whose arguments/returns are pretty big (~100MB) and when I recently increased the size of these arguments a fair bit (up to ~300MB) I started getting KilledWorker exceptions.

It seemed to be dependent on problem size (with which both memory and computation time scales) and on the number of workers: this happened with a LocalCluster(6), but not (or not as reliably) with a LocalCluster(2) or LocalCluster(1).

Finally I turned logging on - which I should’ve done much sooner - and found that when the task failed, I got

distributed.nanny - WARNING - Worker exceeded 95% memory budget.  Restarting
distributed.nanny - WARNING - Worker process 7297 was killed by unknown signal
distributed.scheduler - INFO - Worker 'tcp://127.0.0.1:34409' failed from closed comm: in <closed TCP>: Stream is closed
distributed.scheduler - INFO - Remove worker tcp://127.0.0.1:34409
distributed.nanny - WARNING - Restarting worker

The root problem is that the process termination memory limit is evenly divided between processes, so with a lot of processes and big tasks you can easily run into it without saturating your system-level limits. Immediate solution is to set the worker memory termination limit to False.

Might also be a good idea to lower the default silence_logs level to 30 so warnings show up by default.

1reaction
wtbarnescommented, Jun 25, 2018

tl;dr: Using memory_limit as suggested by @jakirkham and @andyljones solves this issue.

Additional explantion: With dask 0.18.0 and distributed 1.22.0, using the minimal example I originally posted, I now get the repeated warning

distributed.nanny - WARNING - Worker exceeded 95% memory budget.  Restarting
distributed.nanny - WARNING - Worker process 68829 was killed by unknown signal

followed by the error,

KilledWorker: ('array-original-f874af421ac213e98c4a116a470adba2', 'tcp://127.0.0.1:34788')

which is indeed more helpful than the previous error I was getting and clearly indicates the need for more memory per worker. Starting up a LocalCluster with more memory allotted per worker (and fewer workers) fixes this problem, e.g.

cluster = LocalCluster(n_workers=32,memory_limit='8GB')
client = Client(cluster)

(as opposed to 64 workers with 4GB each).

Read more comments on GitHub >

github_iconTop Results From Across the Web

How to fft huge data arrays? - DSPRelated.com
Hi! For my little project i need to fft large data arrays, that do not fit into memory. So far i know about...
Read more >
Minian, an open-source miniscope analysis pipeline - eLife
Minian is an open-source analysis pipeline for calcium imaging data that enhances usability with low memory demands and transparency with ...
Read more >
KilledWorker exception using Dask on 1 machine and a large ...
I am not using any clusters, and I have tried to use pandas, but the RAM would go up to 100%, so I...
Read more >
Taking the 3D FFT of a matrix with significant memory, ~GB
I am trying to take the 3D FFT of a large 3D array, but as soon as the memory of the array reaches...
Read more >
Full text of "The Daily Colonist (1920-11-07)" - Internet Archive
iMiir AIOS CHILDREN Hands German Prelate Large Sum for Relief Work In ... In two of the districts reported today the Republican victory...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found