question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Nanny error: Worker process was killed by unknown signal

See original GitHub issue
distributed.nanny - WARNING - Worker process 13375 was killed by unknown signal
distributed.nanny - WARNING - Restarting worker
distributed.nanny - WARNING - Worker process 13377 was killed by unknown signal
distributed.nanny - WARNING - Restarting worker
distributed.nanny - WARNING - Worker process 13372 was killed by unknown signal
distributed.nanny - WARNING - Restarting worker
distributed.nanny - WARNING - Worker process 13383 was killed by unknown signal
distributed.nanny - WARNING - Restarting worker
distributed.nanny - WARNING - Worker process 13373 was killed by unknown signal
distributed.nanny - WARNING - Restarting worker
distributed.nanny - WARNING - Worker process 13384 was killed by unknown signal
distributed.nanny - WARNING - Restarting worker
distributed.nanny - WARNING - Worker process 13380 was killed by unknown signal
distributed.nanny - WARNING - Restarting worker

Happens without fail when using read_parquet with fastparquet can be avoided with pyarrow but still happens x% of the time. (x depends on how you setup n_workers, n_clients, memory_limit in client but would say is always greater than 25%).

My machine runs Fedora 27 and I was able to work around the problem by setting multiprocessing-method to spawn thanks to help from @mrocklin.

(In debugging this with @mrocklin we were never able to get more information out about what the root cause was).

Issue Analytics

  • State:closed
  • Created 6 years ago
  • Comments:15 (8 by maintainers)

github_iconTop GitHub Comments

1reaction
mckeown12commented, Feb 7, 2019

Another (self contained, though less minimal) example:

import pyarrow
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.rand(3_000_000,20))
df.columns=['a_{}'.format(i) for i in range(20)]
df['a_1']=(df['a_1']*10000).astype(int)
df.to_parquet('./test.p', compression='gzip')
from dask.distributed import Client
import dask.dataframe as dd

client = Client()
df2 = dd.read_parquet('./test.p')
df2 = client.persist(df2)

So far so good. Then df2.mean().compute() results in traceback:

distributed.nanny - WARNING - Worker exceeded 95% memory budget. Restarting
tornado.application - ERROR - Exception in callback <bound method Nanny.memory_monitor of <Nanny: tcp://127.0.0.1:53696, threads: 1>>
Traceback (most recent call last):
  File "/Users/jamesmckeown/anaconda2/envs/py36/lib/python3.6/site-packages/psutil/_psosx.py", line 355, in catch_zombie
    yield
  File "/Users/jamesmckeown/anaconda2/envs/py36/lib/python3.6/site-packages/psutil/_psosx.py", line 394, in _get_pidtaskinfo
    ret = cext.proc_pidtaskinfo_oneshot(self.pid)
ProcessLookupError: [Errno 3] No such process

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/jamesmckeown/anaconda2/envs/py36/lib/python3.6/site-packages/tornado/ioloop.py", line 1229, in _run
    return self.callback()
  File "/Users/jamesmckeown/anaconda2/envs/py36/lib/python3.6/site-packages/distributed/nanny.py", line 264, in memory_monitor
    memory = proc.memory_info().rss
  File "/Users/jamesmckeown/anaconda2/envs/py36/lib/python3.6/site-packages/psutil/_common.py", line 340, in wrapper
    return fun(self)
  File "/Users/jamesmckeown/anaconda2/envs/py36/lib/python3.6/site-packages/psutil/__init__.py", line 1047, in memory_info
    return self._proc.memory_info()
  File "/Users/jamesmckeown/anaconda2/envs/py36/lib/python3.6/site-packages/psutil/_psosx.py", line 335, in wrapper
    return fun(self, *args, **kwargs)
  File "/Users/jamesmckeown/anaconda2/envs/py36/lib/python3.6/site-packages/psutil/_psosx.py", line 463, in memory_info
    rawtuple = self._get_pidtaskinfo()
  File "/Users/jamesmckeown/anaconda2/envs/py36/lib/python3.6/site-packages/psutil/_common.py", line 340, in wrapper
    return fun(self)
  File "/Users/jamesmckeown/anaconda2/envs/py36/lib/python3.6/site-packages/psutil/_psosx.py", line 394, in _get_pidtaskinfo
    ret = cext.proc_pidtaskinfo_oneshot(self.pid)
  File "/Users/jamesmckeown/anaconda2/envs/py36/lib/python3.6/contextlib.py", line 99, in __exit__
    self.gen.throw(type, value, traceback)
  File "/Users/jamesmckeown/anaconda2/envs/py36/lib/python3.6/site-packages/psutil/_psosx.py", line 368, in catch_zombie
    raise AccessDenied(proc.pid, proc._name)
psutil._exceptions.AccessDenied: psutil.AccessDenied (pid=40091)
distributed.nanny - WARNING - Worker process 40091 was killed by unknown signal
distributed.nanny - WARNING - Restarting worker
distributed.nanny - WARNING - Worker exceeded 95% memory budget. Restarting
tornado.application - ERROR - Exception in callback <bound method Nanny.memory_monitor of <Nanny: tcp://127.0.0.1:53698, threads: 1>>
Traceback (most recent call last):
  File "/Users/jamesmckeown/anaconda2/envs/py36/lib/python3.6/site-packages/psutil/_psosx.py", line 355, in catch_zombie
    yield
  File "/Users/jamesmckeown/anaconda2/envs/py36/lib/python3.6/site-packages/psutil/_psosx.py", line 394, in _get_pidtaskinfo
    ret = cext.proc_pidtaskinfo_oneshot(self.pid)
ProcessLookupError: [Errno 3] No such process

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/jamesmckeown/anaconda2/envs/py36/lib/python3.6/site-packages/tornado/ioloop.py", line 1229, in _run
    return self.callback()
  File "/Users/jamesmckeown/anaconda2/envs/py36/lib/python3.6/site-packages/distributed/nanny.py", line 264, in memory_monitor
    memory = proc.memory_info().rss
  File "/Users/jamesmckeown/anaconda2/envs/py36/lib/python3.6/site-packages/psutil/_common.py", line 340, in wrapper
    return fun(self)
  File "/Users/jamesmckeown/anaconda2/envs/py36/lib/python3.6/site-packages/psutil/__init__.py", line 1047, in memory_info
    return self._proc.memory_info()
  File "/Users/jamesmckeown/anaconda2/envs/py36/lib/python3.6/site-packages/psutil/_psosx.py", line 335, in wrapper
    return fun(self, *args, **kwargs)
  File "/Users/jamesmckeown/anaconda2/envs/py36/lib/python3.6/site-packages/psutil/_psosx.py", line 463, in memory_info
    rawtuple = self._get_pidtaskinfo()
  File "/Users/jamesmckeown/anaconda2/envs/py36/lib/python3.6/site-packages/psutil/_common.py", line 340, in wrapper
    return fun(self)
  File "/Users/jamesmckeown/anaconda2/envs/py36/lib/python3.6/site-packages/psutil/_psosx.py", line 394, in _get_pidtaskinfo
    ret = cext.proc_pidtaskinfo_oneshot(self.pid)
  File "/Users/jamesmckeown/anaconda2/envs/py36/lib/python3.6/contextlib.py", line 99, in __exit__
    self.gen.throw(type, value, traceback)
  File "/Users/jamesmckeown/anaconda2/envs/py36/lib/python3.6/site-packages/psutil/_psosx.py", line 368, in catch_zombie
    raise AccessDenied(proc.pid, proc._name)
psutil._exceptions.AccessDenied: psutil.AccessDenied (pid=40090)
distributed.nanny - WARNING - Worker process 40090 was killed by unknown signal
distributed.nanny - WARNING - Restarting worker
distributed.nanny - WARNING - Worker exceeded 95% memory budget. Restarting
tornado.application - ERROR - Exception in callback <bound method Nanny.memory_monitor of <Nanny: tcp://127.0.0.1:53701, threads: 1>>
Traceback (most recent call last):
  File "/Users/jamesmckeown/anaconda2/envs/py36/lib/python3.6/site-packages/psutil/_psosx.py", line 355, in catch_zombie
    yield
  File "/Users/jamesmckeown/anaconda2/envs/py36/lib/python3.6/site-packages/psutil/_psosx.py", line 394, in _get_pidtaskinfo
    ret = cext.proc_pidtaskinfo_oneshot(self.pid)
ProcessLookupError: [Errno 3] No such process

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/jamesmckeown/anaconda2/envs/py36/lib/python3.6/site-packages/tornado/ioloop.py", line 1229, in _run
    return self.callback()
  File "/Users/jamesmckeown/anaconda2/envs/py36/lib/python3.6/site-packages/distributed/nanny.py", line 264, in memory_monitor
    memory = proc.memory_info().rss
  File "/Users/jamesmckeown/anaconda2/envs/py36/lib/python3.6/site-packages/psutil/_common.py", line 340, in wrapper
    return fun(self)
  File "/Users/jamesmckeown/anaconda2/envs/py36/lib/python3.6/site-packages/psutil/__init__.py", line 1047, in memory_info
    return self._proc.memory_info()
  File "/Users/jamesmckeown/anaconda2/envs/py36/lib/python3.6/site-packages/psutil/_psosx.py", line 335, in wrapper
    return fun(self, *args, **kwargs)
  File "/Users/jamesmckeown/anaconda2/envs/py36/lib/python3.6/site-packages/psutil/_psosx.py", line 463, in memory_info
    rawtuple = self._get_pidtaskinfo()
  File "/Users/jamesmckeown/anaconda2/envs/py36/lib/python3.6/site-packages/psutil/_common.py", line 340, in wrapper
    return fun(self)
  File "/Users/jamesmckeown/anaconda2/envs/py36/lib/python3.6/site-packages/psutil/_psosx.py", line 394, in _get_pidtaskinfo
    ret = cext.proc_pidtaskinfo_oneshot(self.pid)
  File "/Users/jamesmckeown/anaconda2/envs/py36/lib/python3.6/contextlib.py", line 99, in __exit__
    self.gen.throw(type, value, traceback)
  File "/Users/jamesmckeown/anaconda2/envs/py36/lib/python3.6/site-packages/psutil/_psosx.py", line 368, in catch_zombie
    raise AccessDenied(proc.pid, proc._name)
psutil._exceptions.AccessDenied: psutil.AccessDenied (pid=40095)
distributed.nanny - WARNING - Worker process 40095 was killed by unknown signal
distributed.nanny - WARNING - Restarting worker
distributed.nanny - WARNING - Worker exceeded 95% memory budget. Restarting
tornado.application - ERROR - Exception in callback <bound method Nanny.memory_monitor of <Nanny: tcp://127.0.0.1:53702, threads: 1>>
Traceback (most recent call last):
  File "/Users/jamesmckeown/anaconda2/envs/py36/lib/python3.6/site-packages/psutil/_psosx.py", line 355, in catch_zombie
    yield
  File "/Users/jamesmckeown/anaconda2/envs/py36/lib/python3.6/site-packages/psutil/_psosx.py", line 394, in _get_pidtaskinfo
    ret = cext.proc_pidtaskinfo_oneshot(self.pid)
ProcessLookupError: [Errno 3] No such process

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/jamesmckeown/anaconda2/envs/py36/lib/python3.6/site-packages/tornado/ioloop.py", line 1229, in _run
    return self.callback()
  File "/Users/jamesmckeown/anaconda2/envs/py36/lib/python3.6/site-packages/distributed/nanny.py", line 264, in memory_monitor
    memory = proc.memory_info().rss
  File "/Users/jamesmckeown/anaconda2/envs/py36/lib/python3.6/site-packages/psutil/_common.py", line 340, in wrapper
    return fun(self)
  File "/Users/jamesmckeown/anaconda2/envs/py36/lib/python3.6/site-packages/psutil/__init__.py", line 1047, in memory_info
    return self._proc.memory_info()
  File "/Users/jamesmckeown/anaconda2/envs/py36/lib/python3.6/site-packages/psutil/_psosx.py", line 335, in wrapper
    return fun(self, *args, **kwargs)
  File "/Users/jamesmckeown/anaconda2/envs/py36/lib/python3.6/site-packages/psutil/_psosx.py", line 463, in memory_info
    rawtuple = self._get_pidtaskinfo()
  File "/Users/jamesmckeown/anaconda2/envs/py36/lib/python3.6/site-packages/psutil/_common.py", line 340, in wrapper
    return fun(self)
  File "/Users/jamesmckeown/anaconda2/envs/py36/lib/python3.6/site-packages/psutil/_psosx.py", line 394, in _get_pidtaskinfo
    ret = cext.proc_pidtaskinfo_oneshot(self.pid)
  File "/Users/jamesmckeown/anaconda2/envs/py36/lib/python3.6/contextlib.py", line 99, in __exit__
    self.gen.throw(type, value, traceback)
  File "/Users/jamesmckeown/anaconda2/envs/py36/lib/python3.6/site-packages/psutil/_psosx.py", line 368, in catch_zombie
    raise AccessDenied(proc.pid, proc._name)
psutil._exceptions.AccessDenied: psutil.AccessDenied (pid=40094)
tornado.application - ERROR - Exception in callback <bound method Nanny.memory_monitor of <Nanny: tcp://127.0.0.1:53702, threads: 1>>
Traceback (most recent call last):
  File "/Users/jamesmckeown/anaconda2/envs/py36/lib/python3.6/site-packages/psutil/_psosx.py", line 355, in catch_zombie
    yield
  File "/Users/jamesmckeown/anaconda2/envs/py36/lib/python3.6/site-packages/psutil/_psosx.py", line 394, in _get_pidtaskinfo
    ret = cext.proc_pidtaskinfo_oneshot(self.pid)
ProcessLookupError: [Errno 3] No such process

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/jamesmckeown/anaconda2/envs/py36/lib/python3.6/site-packages/tornado/ioloop.py", line 1229, in _run
    return self.callback()
  File "/Users/jamesmckeown/anaconda2/envs/py36/lib/python3.6/site-packages/distributed/nanny.py", line 264, in memory_monitor
    memory = proc.memory_info().rss
  File "/Users/jamesmckeown/anaconda2/envs/py36/lib/python3.6/site-packages/psutil/_common.py", line 340, in wrapper
    return fun(self)
  File "/Users/jamesmckeown/anaconda2/envs/py36/lib/python3.6/site-packages/psutil/__init__.py", line 1047, in memory_info
    return self._proc.memory_info()
  File "/Users/jamesmckeown/anaconda2/envs/py36/lib/python3.6/site-packages/psutil/_psosx.py", line 335, in wrapper
    return fun(self, *args, **kwargs)
  File "/Users/jamesmckeown/anaconda2/envs/py36/lib/python3.6/site-packages/psutil/_psosx.py", line 463, in memory_info
    rawtuple = self._get_pidtaskinfo()
  File "/Users/jamesmckeown/anaconda2/envs/py36/lib/python3.6/site-packages/psutil/_common.py", line 340, in wrapper
    return fun(self)
  File "/Users/jamesmckeown/anaconda2/envs/py36/lib/python3.6/site-packages/psutil/_psosx.py", line 394, in _get_pidtaskinfo
    ret = cext.proc_pidtaskinfo_oneshot(self.pid)
  File "/Users/jamesmckeown/anaconda2/envs/py36/lib/python3.6/contextlib.py", line 99, in __exit__
    self.gen.throw(type, value, traceback)
  File "/Users/jamesmckeown/anaconda2/envs/py36/lib/python3.6/site-packages/psutil/_psosx.py", line 368, in catch_zombie
    raise AccessDenied(proc.pid, proc._name)
psutil._exceptions.AccessDenied: psutil.AccessDenied (pid=40094)
distributed.nanny - WARNING - Worker process 40094 was killed by unknown signal
distributed.nanny - WARNING - Restarting worker
----------------------------------------
KilledWorkerTraceback (most recent call last)
<ipython-input-8-3f4e05f049ae> in <module>
----> 1 df2.mean().compute()

~/anaconda2/envs/py36/lib/python3.6/site-packages/dask/base.py in compute(self, **kwargs)
    154         dask.base.compute
    155         """
--> 156         (result,) = compute(self, traverse=False, **kwargs)
    157         return result
    158 

~/anaconda2/envs/py36/lib/python3.6/site-packages/dask/base.py in compute(*args, **kwargs)
    393     keys = [x.__dask_keys__() for x in collections]
    394     postcomputes = [x.__dask_postcompute__() for x in collections]
--> 395     results = schedule(dsk, keys, **kwargs)
    396     return repack([f(r, *a) for r, (f, a) in zip(results, postcomputes)])
    397 

~/anaconda2/envs/py36/lib/python3.6/site-packages/distributed/client.py in get(self, dsk, keys, restrictions, loose_restrictions, resources, sync, asynchronous, direct, retries, priority, fifo_timeout, **kwargs)
   2228             try:
   2229                 results = self.gather(packed, asynchronous=asynchronous,
-> 2230                                       direct=direct)
   2231             finally:
   2232                 for f in futures.values():

~/anaconda2/envs/py36/lib/python3.6/site-packages/distributed/client.py in gather(self, futures, errors, maxsize, direct, asynchronous)
   1591             return self.sync(self._gather, futures, errors=errors,
   1592                              direct=direct, local_worker=local_worker,
-> 1593                              asynchronous=asynchronous)
   1594 
   1595     @gen.coroutine

~/anaconda2/envs/py36/lib/python3.6/site-packages/distributed/client.py in sync(self, func, *args, **kwargs)
    645             return future
    646         else:
--> 647             return sync(self.loop, func, *args, **kwargs)
    648 
    649     def __repr__(self):

~/anaconda2/envs/py36/lib/python3.6/site-packages/distributed/utils.py in sync(loop, func, *args, **kwargs)
    275             e.wait(10)
    276     if error[0]:
--> 277         six.reraise(*error[0])
    278     else:
    279         return result[0]

~/anaconda2/envs/py36/lib/python3.6/site-packages/six.py in reraise(tp, value, tb)
    691             if value.__traceback__ is not tb:
    692                 raise value.with_traceback(tb)
--> 693             raise value
    694         finally:
    695             value = None

~/anaconda2/envs/py36/lib/python3.6/site-packages/distributed/utils.py in f()
    260             if timeout is not None:
    261                 future = gen.with_timeout(timedelta(seconds=timeout), future)
--> 262             result[0] = yield future
    263         except Exception as exc:
    264             error[0] = sys.exc_info()

~/anaconda2/envs/py36/lib/python3.6/site-packages/tornado/gen.py in run(self)
   1131 
   1132                     try:
-> 1133                         value = future.result()
   1134                     except Exception:
   1135                         self.had_exception = True

~/anaconda2/envs/py36/lib/python3.6/site-packages/tornado/gen.py in run(self)
   1139                     if exc_info is not None:
   1140                         try:
-> 1141                             yielded = self.gen.throw(*exc_info)
   1142                         finally:
   1143                             # Break up a reference to itself

~/anaconda2/envs/py36/lib/python3.6/site-packages/distributed/client.py in _gather(self, futures, errors, direct, local_worker)
   1467                             six.reraise(type(exception),
   1468                                         exception,
-> 1469                                         traceback)
   1470                     if errors == 'skip':
   1471                         bad_keys.add(key)

~/anaconda2/envs/py36/lib/python3.6/site-packages/six.py in reraise(tp, value, tb)
    691             if value.__traceback__ is not tb:
    692                 raise value.with_traceback(tb)
--> 693             raise value
    694         finally:
    695             value = None

KilledWorker: ("('dataframe-sum-chunk-dataframe-sum-agg-c3bb33a21a4ded5f08dea3ba88b780b0', 0)", 'tcp://127.0.0.1:53702')

Similar code snippets which execute as expected:

  1. Remove the line df['a_1']=(df['a_1']*10000).astype(int).
  2. Reduce np.random.rand(3_000_000,20) to np.random.rand(2_000_000,20)
pdf = pd.DataFrame(np.random.rand(10_000_000,20))
df = dd.from_pandas(pdf,chunksize=10000)
df2 = client.persist(df)
df2.mean().compute()
1reaction
mrocklincommented, Oct 1, 2018

This issue would benefit from a minimum reproducible example.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Why did my worker die?
The worker process may stop working without notice. This can happen due to something internal to the worker, e.g., a memory violation (common...
Read more >
Dask Dataframe Distributed Process ID Access Denied
nanny - WARNING - Worker process 18843 was killed by unknown signal. I'll play around some more, maybe something on my machine is...
Read more >
distributed.nanny — Dask.distributed 2.11.0 documentation
[docs]class Nanny(ServerNode): """ A process to manage worker processes The ... if exitcode == 255: return "Worker process %d was killed by unknown...
Read more >
[DM-13645] Figure out how to start a dask cluster at lsst-dev using ...
Fatal error in PMPI_Init_thread: Other MPI error, error stack: ... distributed.nanny - WARNING - Worker process 178401 was killed by unknown signal.
Read more >
1006786 - Harassment and signal squelchening Description
My phone now is starting to experience static when DSL is not working properly. ... with unfamiliar cable boxes, one with a modular...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found