Dask hangs in random moment, hard to reproduce
See original GitHub issueI have quite a big dask graph to compute. It operates on multiple parquet files from s3, processes the data and saves the result to local disk. The problem is, the process can stop in a random moment. I’m not sure what causes it, so I don’t know if I should submit the issue here or in distributed repo.
Setup:
-
AWS EC2 machine with 64 cores and 500 GB of RAM
-
docker container running
dask-scheduler
-
docker container running
dask-worker [scheduler-address] --nprocs 64 --memory-limit 0.3
-
docker container running python code, connecting to the scheduler
-
Python 3.5
-
dask 0.15.2 (same thing happens with newer version)
-
distributed 1.18.3 (same thing happens with newer version)
-
tornado 4.5.2
-
fastparquet 0.1.2
When the computation starts, I can see progress bars in scheduler dashboard. Everything seems to be fine, tasks are being computed, until suddenly everything stops. CPU usage drops to almost zero, progress bars stop, worker’s bokeh page stops responding. When I restart the whole process, it will most probably stop again, but in totally different moment. I couldn’t find any rule on when the progress stops.
I would love to include a minimum example, but I have no idea what causes this behaviour. As I said, it may happen in different moments, so I can’t even tell whether it’s beginning or end of my pipeline. So, I don’t know how to reproduce it.
I tried looking at logs. When the computations stop, there’s no sign of anything going wrong either in worker or scheduler. After almost 1 hour of doing nothing, last lines from worker and scheduler look like this:
dask-scheduler | 2017-11-05T18:16:56.929566066Z distributed.scheduler - INFO - Starting worker compute stream, tcp://172.19.0.4:37075
dask-scheduler | 2017-11-05T18:16:56.932564278Z distributed.scheduler - INFO - Register tcp://172.19.0.4:58982
dask-scheduler | 2017-11-05T18:16:56.933666514Z distributed.scheduler - INFO - Register tcp://172.19.0.4:41864
dask-scheduler | 2017-11-05T18:16:56.935390112Z distributed.scheduler - INFO - Register tcp://172.19.0.4:57315
dask-scheduler | 2017-11-05T18:16:56.937407410Z distributed.scheduler - INFO - Register tcp://172.19.0.4:33051
dask-scheduler | 2017-11-05T18:16:56.939018686Z distributed.scheduler - INFO - Register tcp://172.19.0.4:39833
dask-scheduler | 2017-11-05T18:16:56.940121438Z distributed.scheduler - INFO - Register tcp://172.19.0.4:59492
dask-scheduler | 2017-11-05T18:16:56.940805054Z distributed.scheduler - INFO - Starting worker compute stream, tcp://172.19.0.4:58982
dask-scheduler | 2017-11-05T18:16:56.941496080Z distributed.scheduler - INFO - Starting worker compute stream, tcp://172.19.0.4:41864
dask-scheduler | 2017-11-05T18:16:56.941522107Z distributed.scheduler - INFO - Starting worker compute stream, tcp://172.19.0.4:57315
dask-scheduler | 2017-11-05T18:16:56.942195523Z distributed.scheduler - INFO - Starting worker compute stream, tcp://172.19.0.4:59492
dask-scheduler | 2017-11-05T18:16:56.942201960Z distributed.scheduler - INFO - Starting worker compute stream, tcp://172.19.0.4:33051
dask-scheduler | 2017-11-05T18:16:56.942719202Z distributed.scheduler - INFO - Starting worker compute stream, tcp://172.19.0.4:39833
dask-scheduler | 2017-11-05T18:16:56.951637056Z distributed.scheduler - INFO - Register tcp://172.19.0.4:38661
dask-scheduler | 2017-11-05T18:16:56.952947498Z distributed.scheduler - INFO - Starting worker compute stream, tcp://172.19.0.4:38661
dask-scheduler | 2017-11-05T18:16:56.957188880Z distributed.scheduler - INFO - Register tcp://172.19.0.4:40078
dask-scheduler | 2017-11-05T18:16:56.958534533Z distributed.scheduler - INFO - Starting worker compute stream, tcp://172.19.0.4:40078
dask-scheduler | 2017-11-05T18:16:57.202564703Z distributed.scheduler - INFO - Receive client connection: Client-82484af4-c255-11e7-8082-0242ac130005
dask-scheduler | 2017-11-05T18:19:44.344417691Z distributed.core - WARNING - Event loop was unresponsive for 1.05s. This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.
dask-scheduler | 2017-11-05T18:20:08.078638857Z distributed.core - WARNING - Event loop was unresponsive for 23.08s. This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.
dask-worker_1 | 2017-11-05T18:21:32.362563969Z distributed.worker - DEBUG - future state: ('read-parquet-b2c5abe77c11876195eb6a77738d1492', 19) - RUNNING
dask-worker_1 | 2017-11-05T18:21:32.362579815Z distributed.worker - DEBUG - Execute key: ('concat-7075105cf1c6acffe24d76c23229da72', 4361) worker: tcp://172.19.0.4:53559
dask-worker_1 | 2017-11-05T18:21:32.393695151Z distributed.worker - DEBUG - Deleted 1 keys
dask-worker_1 | 2017-11-05T18:21:32.394078549Z distributed.worker - DEBUG - Deleted 1 keys
dask-worker_1 | 2017-11-05T18:21:32.394140791Z distributed.worker - DEBUG - Send compute response to scheduler: ('dropna-de531fe104bb0718a633ac3ba57e44c9', 4505), {'thread': 140187946239744, 'status': 'OK', 'op': 'task-finished', 'type': <class 'pandas.core.frame.DataFrame'>, 'key': "('dropna-de531fe104bb0718a633ac3ba57e44c9', 4505)", 'stop': 1509906092.38893, 'start': 1509906091.7096455, 'nbytes': 347313008}
dask-worker_1 | 2017-11-05T18:21:32.395095639Z distributed.worker - DEBUG - Execute key: ('astype-fa084f475d61990b4586d370bacd818e', 4324) worker: tcp://172.19.0.4:44932
dask-worker_1 | 2017-11-05T18:21:32.401399080Z distributed.worker - DEBUG - Send compute response to scheduler: ('assign-1d9eb7afe69be0e3579b6a8be9c6b291', 4455), {'thread': 140187945977600, 'status': 'OK', 'op': 'task-finished', 'type': <class 'pandas.core.frame.DataFrame'>, 'key': "('assign-1d9eb7afe69be0e3579b6a8be9c6b291', 4455)", 'stop': 1509906092.3977563, 'start': 1509906092.1396115, 'nbytes': 322818495}
dask-worker_1 | 2017-11-05T18:21:32.402049239Z distributed.worker - DEBUG - Execute key: ('astype-fa084f475d61990b4586d370bacd818e', 4398) worker: tcp://172.19.0.4:59993
dask-worker_1 | 2017-11-05T18:21:32.431883216Z distributed.worker - DEBUG - Send compute response to scheduler: ('read-parquet-b2c5abe77c11876195eb6a77738d1492', 19), {'thread': 140187923695360, 'status': 'OK', 'op': 'task-finished', 'type': <class 'pandas.core.frame.DataFrame'>, 'key': "('read-parquet-b2c5abe77c11876195eb6a77738d1492', 19)", 'stop': 1509906092.4114537, 'start': 1509906079.2736945, 'nbytes': 365677731}
dask-worker_1 | 2017-11-05T18:21:32.433350396Z distributed.worker - DEBUG - Execute key: ('astype-867f8e9773a47c02dc913a3001779e23', 4419) worker: tcp://172.19.0.4:47891
dask-worker_1 | 2017-11-05T18:21:32.433365801Z distributed.worker - DEBUG - Send compute response to scheduler: ('astype-955e0ccfc224ca95aa0211b2986fc390', 4434), {'thread': 140187922646784, 'status': 'OK', 'op': 'task-finished', 'type': <class 'pandas.core.series.Series'>, 'key': "('astype-955e0ccfc224ca95aa0211b2986fc390', 4434)", 'stop': 1509906092.4325304, 'start': 1509906092.3020687, 'nbytes': 4140533}
dask-worker_1 | 2017-11-05T18:21:32.434231696Z distributed.worker - DEBUG - Execute key: ('getitem-936f98cb941dae846251947cdf358006', 4388) worker: tcp://172.19.0.4:38661
dask-worker_1 | 2017-11-05T18:21:32.435753372Z distributed.worker - DEBUG - Send compute response to scheduler: ('getitem-936f98cb941dae846251947cdf358006', 4388), {'thread': 140187922646784, 'status': 'OK', 'op': 'task-finished', 'type': <class 'pandas.core.series.Series'>, 'key': "('getitem-936f98cb941dae846251947cdf358006', 4388)", 'stop': 1509906092.4348502, 'start': 1509906092.4345806, 'nbytes': 56286028}
dask-worker_1 | 2017-11-05T18:21:32.436108651Z distributed.worker - DEBUG - Execute key: ('assign-7e08af70bcc5505618642c0209ce5970', 4339) worker: tcp://172.19.0.4:38661
dask-worker_1 | 2017-11-05T18:21:32.438114434Z distributed.worker - DEBUG - Deleted 1 keys
dask-worker_1 | 2017-11-05T18:21:32.456848909Z distributed.worker - DEBUG - Send compute response to scheduler: ('concat-7075105cf1c6acffe24d76c23229da72', 4273), {'thread': 140187923957504, 'status': 'OK', 'op': 'task-finished', 'type': <class 'pandas.core.frame.DataFrame'>, 'key': "('concat-7075105cf1c6acffe24d76c23229da72', 4273)", 'stop': 1509906092.3513696, 'start': 1509906092.1742594, 'nbytes': 362452000}
dask-worker_1 | 2017-11-05T18:21:32.458012439Z distributed.worker - DEBUG - Execute key: ('getitem-8e7fcdbbeeef95f9da75ec0adbd9adc3', 4477) worker: tcp://172.19.0.4:48040
dask-worker_1 | 2017-11-05T18:21:32.458026177Z distributed.worker - DEBUG - future state: ('read-parquet-248809b98a4a58f4cebbf22d9fc05dfe', 47) - RUNNING
dask-worker_1 | 2017-11-05T18:21:32.461667639Z distributed.worker - DEBUG - Send compute response to scheduler: ('getitem-8e7fcdbbeeef95f9da75ec0adbd9adc3', 4477), {'thread': 140187923957504, 'status': 'OK', 'op': 'task-finished', 'type': <class 'pandas.core.series.Series'>, 'key': "('getitem-8e7fcdbbeeef95f9da75ec0adbd9adc3', 4477)", 'stop': 1509906092.4587786, 'start': 1509906092.458351, 'nbytes': 35910430}
dask-worker_1 | 2017-11-05T18:21:32.462353208Z distributed.worker - DEBUG - Execute key: ('concat-7075105cf1c6acffe24d76c23229da72', 4422) worker: tcp://172.19.0.4:48040
Here you can see the whole logs: worker.log scheduler.log
After I kill the client, following lines appear: worker_clientKilled.log scheduler_clientKilled.log
Output from the client is like:
distributed.comm.tcp - DEBUG - Setting TCP keepalive: nprobes=10, idle=10, interval=2
distributed.comm.tcp - DEBUG - Setting TCP user timeout: 30000 ms
distributed.comm.tcp - DEBUG - Setting TCP keepalive: nprobes=10, idle=10, interval=2
distributed.comm.tcp - DEBUG - Setting TCP user timeout: 30000 ms
distributed.client - DEBUG - Started scheduling coroutines. Synchronized
distributed.client - DEBUG - Waiting on futures to clear before gather
<here it hangs forever and I have to kill it>
I will be grateful for any help with debugging this issue.
Issue Analytics
- State:
- Created 6 years ago
- Reactions:7
- Comments:23 (14 by maintainers)
Top GitHub Comments
Conversation seems to have died here so I’m going to close. Happy to reopen if things start up again.
For others who come along, some common causes for an unresponsive cluster:
Common things you can do to check what is wrong
client.get_versions(check=True)
client.call_stack()
This is odd. My first guess would be that you’re trying to move a large amount of data from workers to the client, such as might be caused by calling
df.compute()
on a very large result.Alternatively, you might want to check the memory and CPU use of the scheduler to make sure that it isn’t becoming overwhelmed. This is available on the
System
tab of the diagnostic dashboard.