question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Debugging segfaults on workers

See original GitHub issue

I am having awful problems with segfaulting workers and don’t really know how to go about debugging. Typical tracebacks are below, it shows up in the client output as connection errors and workers killed by unknown signals.

As far as I can tell there is no particular part of my code that causes the segfaults as they show up almost randomly in different runs with the same parameters.

This is currently proving showstopping for my use of dask for large array based jobs as I can’t do performance or memory use optimisation while my workers are constantly dying.

distributed.nanny - WARNING - Worker process 28865 was killed by unknown signal
distributed.nanny - WARNING - Restarting worker
distributed.worker - ERROR - Worker stream died during communication: tcp://127.0.0.1:44041
Traceback (most recent call last):
  File "/usr/lib64/python3.6/site-packages/distributed/comm/tcp.py", line 177, in read
    n_frames = yield stream.read_bytes(8)
  File "/usr/lib64/python3.6/site-packages/tornado/gen.py", line 1133, in run
    value = future.result()
tornado.iostream.StreamClosedError: Stream is closed

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib64/python3.6/site-packages/distributed/worker.py", line 1808, in gather_dep
    who=self.address)
  File "/usr/lib64/python3.6/site-packages/tornado/gen.py", line 1133, in run
    value = future.result()
  File "/usr/lib64/python3.6/site-packages/tornado/gen.py", line 1141, in run
    yielded = self.gen.throw(*exc_info)
  File "/usr/lib64/python3.6/site-packages/distributed/worker.py", line 2808, in get_data_from_worker
    max_connections=max_connections)
  File "/usr/lib64/python3.6/site-packages/tornado/gen.py", line 1133, in run
    value = future.result()
  File "/usr/lib64/python3.6/site-packages/tornado/gen.py", line 1141, in run
    yielded = self.gen.throw(*exc_info)
  File "/usr/lib64/python3.6/site-packages/distributed/core.py", line 431, in send_recv
    response = yield comm.read(deserializers=deserializers)
  File "/usr/lib64/python3.6/site-packages/tornado/gen.py", line 1133, in run
    value = future.result()
  File "/usr/lib64/python3.6/site-packages/tornado/gen.py", line 1141, in run
    yielded = self.gen.throw(*exc_info)
  File "/usr/lib64/python3.6/site-packages/distributed/comm/tcp.py", line 198, in read
    convert_stream_closed_error(self, e)
  File "/usr/lib64/python3.6/site-packages/distributed/comm/tcp.py", line 124, in convert_stream_closed_error
    raise CommClosedError("in %s: %s: %s" % (obj, exc.__class__.__name__, exc))
distributed.comm.core.CommClosedError: in <closed TCP>: ConnectionResetError: [Errno 104] Connection reset by peer
distributed.worker - ERROR - Worker stream died during communication: tcp://127.0.0.1:44041
Traceback (most recent call last):
  File "/usr/lib64/python3.6/site-packages/distributed/comm/tcp.py", line 177, in read
    n_frames = yield stream.read_bytes(8)
  File "/usr/lib64/python3.6/site-packages/tornado/gen.py", line 1133, in run
    value = future.result()
tornado.iostream.StreamClosedError: Stream is closed

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib64/python3.6/site-packages/distributed/worker.py", line 1808, in gather_dep
    who=self.address)
  File "/usr/lib64/python3.6/site-packages/tornado/gen.py", line 1133, in run
    value = future.result()
  File "/usr/lib64/python3.6/site-packages/tornado/gen.py", line 1141, in run
    yielded = self.gen.throw(*exc_info)
  File "/usr/lib64/python3.6/site-packages/distributed/worker.py", line 2808, in get_data_from_worker
    max_connections=max_connections)
  File "/usr/lib64/python3.6/site-packages/tornado/gen.py", line 1133, in run
    value = future.result()
  File "/usr/lib64/python3.6/site-packages/tornado/gen.py", line 1141, in run
    yielded = self.gen.throw(*exc_info)
  File "/usr/lib64/python3.6/site-packages/distributed/core.py", line 431, in send_recv
    response = yield comm.read(deserializers=deserializers)
  File "/usr/lib64/python3.6/site-packages/tornado/gen.py", line 1133, in run
    value = future.result()
  File "/usr/lib64/python3.6/site-packages/tornado/gen.py", line 1141, in run
    yielded = self.gen.throw(*exc_info)
  File "/usr/lib64/python3.6/site-packages/distributed/comm/tcp.py", line 198, in read
    convert_stream_closed_error(self, e)
  File "/usr/lib64/python3.6/site-packages/distributed/comm/tcp.py", line 124, in convert_stream_closed_error
    raise CommClosedError("in %s: %s: %s" % (obj, exc.__class__.__name__, exc))
distributed.comm.core.CommClosedError: in <closed TCP>: ConnectionResetError: [Errno 104] Connection reset by peer
distributed.nanny - WARNING - Worker process 29375 was killed by unknown signal
distributed.nanny - WARNING - Restarting worker
distributed.nanny - WARNING - Worker process 28867 was killed by unknown signal
distributed.nanny - WARNING - Restarting worker
distributed.nanny - WARNING - Worker process 30141 was killed by unknown signal
distributed.nanny - WARNING - Restarting worker
distributed.worker - ERROR - Worker stream died during communication: tcp://127.0.0.1:44041
Traceback (most recent call last):
  File "/usr/lib64/python3.6/site-packages/distributed/comm/core.py", line 186, in connect
    quiet_exceptions=EnvironmentError)
  File "/usr/lib64/python3.6/site-packages/tornado/gen.py", line 1133, in run
    value = future.result()
tornado.util.TimeoutError: Timeout

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib64/python3.6/site-packages/distributed/worker.py", line 1808, in gather_dep
    who=self.address)
  File "/usr/lib64/python3.6/site-packages/tornado/gen.py", line 1133, in run
    value = future.result()
  File "/usr/lib64/python3.6/site-packages/tornado/gen.py", line 1141, in run
    yielded = self.gen.throw(*exc_info)
  File "/usr/lib64/python3.6/site-packages/distributed/worker.py", line 2802, in get_data_from_worker
    comm = yield rpc.connect(worker)
  File "/usr/lib64/python3.6/site-packages/tornado/gen.py", line 1133, in run
    value = future.result()
  File "/usr/lib64/python3.6/site-packages/tornado/gen.py", line 1141, in run
    yielded = self.gen.throw(*exc_info)
  File "/usr/lib64/python3.6/site-packages/distributed/core.py", line 725, in connect
    connection_args=self.connection_args)
  File "/usr/lib64/python3.6/site-packages/tornado/gen.py", line 1133, in run
    value = future.result()
  File "/usr/lib64/python3.6/site-packages/tornado/gen.py", line 1141, in run
    yielded = self.gen.throw(*exc_info)
  File "/usr/lib64/python3.6/site-packages/distributed/comm/core.py", line 195, in connect
    _raise(error)
  File "/usr/lib64/python3.6/site-packages/distributed/comm/core.py", line 178, in _raise
    raise IOError(msg)
OSError: Timed out trying to connect to 'tcp://127.0.0.1:44041' after 3 s: in <distributed.comm.tcp.TCPConnector object at 0x7f9692278ac8>: ConnectionRefusedError: [Errno 111] Connection refused
distributed.worker - ERROR - Worker stream died during communication: tcp://127.0.0.1:44041
Traceback (most recent call last):
  File "/usr/lib64/python3.6/site-packages/distributed/comm/core.py", line 186, in connect
    quiet_exceptions=EnvironmentError)
  File "/usr/lib64/python3.6/site-packages/tornado/gen.py", line 1133, in run
    value = future.result()
tornado.util.TimeoutError: Timeout

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib64/python3.6/site-packages/distributed/worker.py", line 1808, in gather_dep
    who=self.address)
  File "/usr/lib64/python3.6/site-packages/tornado/gen.py", line 1133, in run
    value = future.result()
  File "/usr/lib64/python3.6/site-packages/tornado/gen.py", line 1141, in run
    yielded = self.gen.throw(*exc_info)
  File "/usr/lib64/python3.6/site-packages/distributed/worker.py", line 2802, in get_data_from_worker
    comm = yield rpc.connect(worker)
  File "/usr/lib64/python3.6/site-packages/tornado/gen.py", line 1133, in run
    value = future.result()
  File "/usr/lib64/python3.6/site-packages/tornado/gen.py", line 1141, in run
    yielded = self.gen.throw(*exc_info)
  File "/usr/lib64/python3.6/site-packages/distributed/core.py", line 725, in connect
    connection_args=self.connection_args)
  File "/usr/lib64/python3.6/site-packages/tornado/gen.py", line 1133, in run
    value = future.result()
  File "/usr/lib64/python3.6/site-packages/tornado/gen.py", line 1141, in run
    yielded = self.gen.throw(*exc_info)
  File "/usr/lib64/python3.6/site-packages/distributed/comm/core.py", line 195, in connect
    _raise(error)
  File "/usr/lib64/python3.6/site-packages/distributed/comm/core.py", line 178, in _raise
    raise IOError(msg)
OSError: Timed out trying to connect to 'tcp://127.0.0.1:44041' after 3 s: in <distributed.comm.tcp.TCPConnector object at 0x7f969693c860>: ConnectionRefusedError: [Errno 111] Connection refused
Normalising by 0.982326090335846
distributed.worker - ERROR - Worker stream died during communication: tcp://127.0.0.1:44041
Traceback (most recent call last):
  File "/usr/lib64/python3.6/site-packages/distributed/comm/core.py", line 186, in connect
    quiet_exceptions=EnvironmentError)
  File "/usr/lib64/python3.6/site-packages/tornado/gen.py", line 1133, in run
    value = future.result()
tornado.util.TimeoutError: Timeout

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib64/python3.6/site-packages/distributed/worker.py", line 1808, in gather_dep
    who=self.address)
  File "/usr/lib64/python3.6/site-packages/tornado/gen.py", line 1133, in run
    value = future.result()
  File "/usr/lib64/python3.6/site-packages/tornado/gen.py", line 1141, in run
    yielded = self.gen.throw(*exc_info)
  File "/usr/lib64/python3.6/site-packages/distributed/worker.py", line 2802, in get_data_from_worker
    comm = yield rpc.connect(worker)
  File "/usr/lib64/python3.6/site-packages/tornado/gen.py", line 1133, in run
    value = future.result()
  File "/usr/lib64/python3.6/site-packages/tornado/gen.py", line 1141, in run
    yielded = self.gen.throw(*exc_info)
  File "/usr/lib64/python3.6/site-packages/distributed/core.py", line 725, in connect
    connection_args=self.connection_args)
  File "/usr/lib64/python3.6/site-packages/tornado/gen.py", line 1133, in run
    value = future.result()
  File "/usr/lib64/python3.6/site-packages/tornado/gen.py", line 1141, in run
    yielded = self.gen.throw(*exc_info)
  File "/usr/lib64/python3.6/site-packages/distributed/comm/core.py", line 195, in connect
    _raise(error)
  File "/usr/lib64/python3.6/site-packages/distributed/comm/core.py", line 178, in _raise
    raise IOError(msg)
OSError: Timed out trying to connect to 'tcp://127.0.0.1:44041' after 3 s: in <distributed.comm.tcp.TCPConnector object at 0x7f96c04ce748>: ConnectionRefusedError: [Errno 111] Connection refused
distributed.nanny - WARNING - Worker process 30462 was killed by unknown signal
distributed.nanny - WARNING - Restarting worker
[10091.246585] python3.6[24625]: segfault at 11 ip 00007efe055d3c1a sp 00007ffc8450a070 error 4 in libffi.so.6.0.4[7efe055ce000+7000]
[10091.336375] traps: python3.6[24623] general protection ip:7efe055d3c1a sp:7ffc8450a070 error:0 in libffi.so.6.0.4[7efe055ce000+7000]
[10099.052996] python3.6[24634]: segfault at a ip 00007efe055d3c1a sp 00007ffc8450a070 error 4 in libffi.so.6.0.4[7efe055ce000+7000]
[10100.278125] traps: python3.6[26421] general protection ip:7efe055d3c1a sp:7ffc8450a070 error:0 in libffi.so.6.0.4[7efe055ce000+7000]
[10117.851446] CPU6: Core temperature above threshold, cpu clock throttled (total events = 8009)
[10117.851447] CPU2: Core temperature above threshold, cpu clock throttled (total events = 8009)
[10117.851448] CPU0: Package temperature above threshold, cpu clock throttled (total events = 8205)
[10117.851449] CPU3: Package temperature above threshold, cpu clock throttled (total events = 8205)
[10117.851450] CPU4: Package temperature above threshold, cpu clock throttled (total events = 8205)
[10117.851451] CPU7: Package temperature above threshold, cpu clock throttled (total events = 8205)
[10117.851452] CPU1: Package temperature above threshold, cpu clock throttled (total events = 8205)
[10117.851453] CPU5: Package temperature above threshold, cpu clock throttled (total events = 8205)
[10117.851454] CPU2: Package temperature above threshold, cpu clock throttled (total events = 8205)
[10117.851455] CPU6: Package temperature above threshold, cpu clock throttled (total events = 8205)
[10117.852348] CPU2: Core temperature/speed normal
[10117.852348] CPU6: Core temperature/speed normal
[10117.852349] CPU5: Package temperature/speed normal
[10117.852350] CPU0: Package temperature/speed normal
[10117.852350] CPU1: Package temperature/speed normal
[10117.852351] CPU4: Package temperature/speed normal
[10117.852351] CPU3: Package temperature/speed normal
[10117.852352] CPU7: Package temperature/speed normal
[10117.852353] CPU6: Package temperature/speed normal
[10117.852354] CPU2: Package temperature/speed normal
[10126.771961] traps: python3.6[27536] general protection ip:7efe055d3c1a sp:7ffc8450a070 error:0 in libffi.so.6.0.4[7efe055ce000+7000]
[10137.567077] python3.6[28620]: segfault at 7efddc13cf4c ip 00007efe055d3c02 sp 00007ffc8450a070 error 4 in libffi.so.6.0.4[7efe055ce000+7000]
[10137.575706] python3.6[28906]: segfault at a2e737f ip 00007efe055d3c1a sp 00007ffc8450a070 error 4 in libffi.so.6.0.4[7efe055ce000+7000]
[10137.577545] traps: python3.6[28646] general protection ip:7efe055d3c1a sp:7ffc8450a070 error:0 in libffi.so.6.0.4[7efe055ce000+7000]
[10137.780183] traps: python3.6[29251] general protection ip:7efe0a9ae2f5 sp:7ffc84509fb8 error:0 in libpython3.6m.so.1.0[7efe0a942000+242000]
[10137.839648] python3.6[28623]: segfault at 5f5f7c ip 00007efe055d3c1a sp 00007ffc8450a070 error 4 in libffi.so.6.0.4[7efe055ce000+7000]
[10139.729006] python3.6[29862]: segfault at a ip 00007efe055d3c1a sp 00007ffc8450a070 error 4 in libffi.so.6.0.4[7efe055ce000+7000]
[10139.731557] python3.6[29828]: segfault at a ip 00007efe055d3c1a sp 00007ffc8450a070 error 4 in libffi.so.6.0.4[7efe055ce000+7000]
[10139.737007] python3.6[29880]: segfault at a ip 00007efe055d3c1a sp 00007ffc8450a070 error 4 in libffi.so.6.0.4[7efe055ce000+7000]
[10141.038372] python3.6[29982]: segfault at a ip 00007efe055d3c1a sp 00007ffc8450a070 error 4 in libffi.so.6.0.4[7efe055ce000+7000]
[10153.693700] python3.6[31084]: segfault at a ip 00007efe055d3c1a sp 00007ffc8450a070 error 4 in libffi.so.6.0.4[7efe055ce000+7000]
[10153.768418] python3.6[31136]: segfault at a ip 00007efe055d3c1a sp 00007ffc8450a070 error 4 in libffi.so.6.0.4[7efe055ce000+7000]
[10154.875317] python3.6[31169]: segfault at a ip 00007efe055d3c1a sp 00007ffc8450a070 error 4 in libffi.so.6.0.4[7efe055ce000+7000]
[10165.830557] python3.6[31762]: segfault at 7efddc13bf4c ip 00007efe055d3c02 sp 00007ffc8450a070 error 4 in libffi.so.6.0.4[7efe055ce000+7000]
[10166.970633] python3.6[32304]: segfault at a ip 00007efe055d3c1a sp 00007ffc8450a070 error 4 in libffi.so.6.0.4[7efe055ce000+7000]

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Comments:7 (5 by maintainers)

github_iconTop GitHub Comments

2reactions
giantamoebacommented, Dec 7, 2018

I had a similar problem. In my case was cloudpickle=0.6.1 to blame. Solved it by downgrading to 0.6.0

For the diagnosis I have used faulthandler:

def enable_fault_handler():
    import faulthandler
    faulthandler.enable()
    print('enabled fault handler')

# run it locally
enable_fault_handler()

# run it on all workers
client.run(enable_fault_handler)

# run it on the scheduler (might fail, but no problem)
client.run_on_scheduler(enable_fault_handler)

0reactions
gorajcommented, Dec 6, 2018

I have a similar problem as @ptooley, using Client(processes=True) locally my workers segfault/trap in ffi.so as well. The following snippet reproduces it on an updated Cent OS 7 machine. The crash triggers depending on what arguments I pass. Importing umap-learn is mandatory to reproduce however. @mrocklin it would be great if you could try and reproduce this issue.

I tested it using the most recent miniconda and anaconda installation on two CentOS 7 machines. To reproduce this issue:

./Miniconda3-latest-Linux-x86_64.sh -b -p ~/m3_centos7 source ~/m3_centos7/bin/activate;conda create -n projectname python=3.6 conda activate projectname conda install dask distributed pip install umap-learn python ~/reproduce.py

output can be seen here: https://gist.github.com/goraj/4f38a803e4d9a02ba54e8b61183a81ea

(projectname) -bash-4.2$ conda --version conda 4.5.11 (projectname) -bash-4.2$ pip list Package Version

bokeh 1.0.2 certifi 2018.10.15 Click 7.0 cloudpickle 0.6.1 cytoolz 0.9.0.1 dask 1.0.0 distributed 1.25.0 heapdict 1.0.0 Jinja2 2.10 llvmlite 0.26.0 locket 0.2.0 MarkupSafe 1.1.0 mkl-fft 1.0.6 mkl-random 1.0.1 msgpack 0.5.6 numba 0.41.0 numpy 1.15.4 olefile 0.46 packaging 18.0 pandas 0.23.4 partd 0.3.9 Pillow 5.3.0 pip 18.1 psutil 5.4.8 pyparsing 2.3.0 python-dateutil 2.7.5 pytz 2018.7 PyYAML 3.13 scikit-learn 0.20.1 scipy 1.1.0 setuptools 40.6.2 six 1.11.0 sortedcontainers 2.1.0 tblib 1.3.2 toolz 0.9.0 tornado 5.1.1 umap-learn 0.3.7 wheel 0.32.3 zict 0.1.3

Read more comments on GitHub >

github_iconTop Results From Across the Web

Debugging Segmentation Faults and Pointer Problems
"Segmentation Fault (core dumped)" is a pretty vague error message, and it's even worse when strange bugs start appearing that don't cause segmentation...
Read more >
How to debug a GCC segmentation fault - GNU Project
How to debug a GCC segmentation fault. Configure GCC with --enable-checking . Compile it with -g -O0 so that you can use gdb...
Read more >
Debugging Segfaults in PHP - the Tideways Documentation
The first step when encountering segfaults is to disable the Tideways extension immediately until you have prepared to debug the problem, so that...
Read more >
How to debug segmentation fault? - c++ - Stack Overflow
Using cout debugging I found that it must be in the loop. Next I wanted to know how far into the loop the...
Read more >
13 hours debugging a segmentation fault in .NET Core on ...
Debugging is a satisfying and special kind of hell. You really have to live it to understand it. When you're deep into it...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found