Shuffle prototype: Feedback (disk usage + workers dying)
See original GitHub issueI have tried #8223 on a ~3.4TB gzipped Parquet dataset.
I tried four runs so far, with two different behaviours
- First I tried the whole dataset. I got to the last step (
to_parquet
), but then ran out of disk space with over 4TB being used by the shuffle - The subsequent three runs I have tried to use a subset (by passing a smaller number of files to
read_parquet
). I did not get over theset_index
step. Workers seem to die and the computation hangs, always towards the end. Errors below.
Based on the first run, this does look like it could’ve actually been successful if I had more disk space. That’s quite exciting, as external sorting has been a big issue for me.
Code used
Sadly I cannot share the data. I am at least sharing the code I’m using.
from dask.distributed import Client, LocalCluster
from pathlib import Path
cluster = LocalCluster(n_workers=10, memory_limit='25GB', threads_per_worker=2)
client = Client(cluster)
paths_files = list(Path('.').glob('*/*.parquet'))
columns = ['col_a', 'col_b']
data = dd.read_parquet(paths_files, columns=columns)
# Here is where workers fail and computation hangs (3 out of 4 tries with a subset)
data = data.set_index('col_a', shuffle="service")
# Here is where it ran out of disk space (1 out of 4 tries)
data.to_parquet('cache-imported.gz.parq', compression='gzip')
Errors
Orginal error
distributed.worker - ERROR - Worker stream died during communication: tcp://127.0.0.1:38001
Traceback (most recent call last):
File "/root/miniconda3/lib/python3.9/site-packages/distributed/comm/tcp.py", line 205, in read
frames_nbytes = await stream.read_bytes(fmt_size)
asyncio.exceptions.CancelledError
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/root/miniconda3/lib/python3.9/asyncio/tasks.py", line 492, in wait_for
fut.result()
asyncio.exceptions.CancelledError
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/root/miniconda3/lib/python3.9/site-packages/distributed/comm/core.py", line 319, in connect
handshake = await asyncio.wait_for(comm.read(), time_left())
File "/root/miniconda3/lib/python3.9/asyncio/tasks.py", line 494, in wait_for
raise exceptions.TimeoutError() from exc
asyncio.exceptions.TimeoutError
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/root/miniconda3/lib/python3.9/site-packages/distributed/worker.py", line 2654, in gather_dep
response = await get_data_from_worker(
File "/root/miniconda3/lib/python3.9/site-packages/distributed/worker.py", line 3982, in get_data_from_worker
return await retry_operation(_get_data, operation="get_data_from_worker")
File "/root/miniconda3/lib/python3.9/site-packages/distributed/utils_comm.py", line 385, in retry_operation
return await retry(
File "/root/miniconda3/lib/python3.9/site-packages/distributed/utils_comm.py", line 370, in retry
return await coro()
File "/root/miniconda3/lib/python3.9/site-packages/distributed/worker.py", line 3959, in _get_data
comm = await rpc.connect(worker)
File "/root/miniconda3/lib/python3.9/site-packages/distributed/core.py", line 1048, in connect
raise exc
File "/root/miniconda3/lib/python3.9/site-packages/distributed/core.py", line 1032, in connect
comm = await fut
File "/root/miniconda3/lib/python3.9/site-packages/distributed/comm/core.py", line 324, in connect
raise OSError(
OSError: Timed out during handshake while connecting to tcp://127.0.0.1:38001 after 30 s
distributed.worker - ERROR - Worker stream died during communication: tcp://127.0.0.1:38215
Traceback (most recent call last):
File "/root/miniconda3/lib/python3.9/site-packages/distributed/comm/tcp.py", line 245, in write
async def write(self, msg, serializers=None, on_error="message"):
asyncio.exceptions.CancelledError
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/root/miniconda3/lib/python3.9/asyncio/tasks.py", line 452, in wait_for
fut.result()
asyncio.exceptions.CancelledError
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/root/miniconda3/lib/python3.9/site-packages/distributed/comm/core.py", line 320, in connect
await asyncio.wait_for(comm.write(local_info), time_left())
File "/root/miniconda3/lib/python3.9/asyncio/tasks.py", line 454, in wait_for
raise exceptions.TimeoutError() from exc
asyncio.exceptions.TimeoutError
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/root/miniconda3/lib/python3.9/site-packages/distributed/worker.py", line 2654, in gather_dep
response = await get_data_from_worker(
File "/root/miniconda3/lib/python3.9/site-packages/distributed/worker.py", line 3982, in get_data_from_worker
return await retry_operation(_get_data, operation="get_data_from_worker")
File "/root/miniconda3/lib/python3.9/site-packages/distributed/utils_comm.py", line 385, in retry_operation
return await retry(
File "/root/miniconda3/lib/python3.9/site-packages/distributed/utils_comm.py", line 370, in retry
return await coro()
File "/root/miniconda3/lib/python3.9/site-packages/distributed/worker.py", line 3959, in _get_data
comm = await rpc.connect(worker)
File "/root/miniconda3/lib/python3.9/site-packages/distributed/core.py", line 1048, in connect
raise exc
File "/root/miniconda3/lib/python3.9/site-packages/distributed/core.py", line 1032, in connect
comm = await fut
File "/root/miniconda3/lib/python3.9/site-packages/distributed/comm/core.py", line 324, in connect
raise OSError(
OSError: Timed out during handshake while connecting to tcp://127.0.0.1:38215 after 30 s
distributed.worker - ERROR - Worker stream died during communication: tcp://127.0.0.1:39419
Traceback (most recent call last):
File "/root/miniconda3/lib/python3.9/site-packages/distributed/comm/tcp.py", line 245, in write
async def write(self, msg, serializers=None, on_error="message"):
asyncio.exceptions.CancelledError
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/root/miniconda3/lib/python3.9/asyncio/tasks.py", line 452, in wait_for
fut.result()
asyncio.exceptions.CancelledError
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/root/miniconda3/lib/python3.9/site-packages/distributed/comm/core.py", line 320, in connect
await asyncio.wait_for(comm.write(local_info), time_left())
File "/root/miniconda3/lib/python3.9/asyncio/tasks.py", line 454, in wait_for
raise exceptions.TimeoutError() from exc
asyncio.exceptions.TimeoutError
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/root/miniconda3/lib/python3.9/site-packages/distributed/worker.py", line 2654, in gather_dep
response = await get_data_from_worker(
File "/root/miniconda3/lib/python3.9/site-packages/distributed/worker.py", line 3982, in get_data_from_worker
return await retry_operation(_get_data, operation="get_data_from_worker")
File "/root/miniconda3/lib/python3.9/site-packages/distributed/utils_comm.py", line 385, in retry_operation
return await retry(
File "/root/miniconda3/lib/python3.9/site-packages/distributed/utils_comm.py", line 370, in retry
return await coro()
File "/root/miniconda3/lib/python3.9/site-packages/distributed/worker.py", line 3959, in _get_data
comm = await rpc.connect(worker)
File "/root/miniconda3/lib/python3.9/site-packages/distributed/core.py", line 1048, in connect
raise exc
File "/root/miniconda3/lib/python3.9/site-packages/distributed/core.py", line 1032, in connect
comm = await fut
File "/root/miniconda3/lib/python3.9/site-packages/distributed/comm/core.py", line 324, in connect
raise OSError(
OSError: Timed out during handshake while connecting to tcp://127.0.0.1:39419 after 30 s
distributed.worker - ERROR - Worker stream died during communication: tcp://127.0.0.1:42233
Traceback (most recent call last):
File "/root/miniconda3/lib/python3.9/site-packages/distributed/comm/tcp.py", line 245, in write
async def write(self, msg, serializers=None, on_error="message"):
asyncio.exceptions.CancelledError
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/root/miniconda3/lib/python3.9/asyncio/tasks.py", line 452, in wait_for
fut.result()
asyncio.exceptions.CancelledError
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/root/miniconda3/lib/python3.9/site-packages/distributed/comm/core.py", line 320, in connect
await asyncio.wait_for(comm.write(local_info), time_left())
File "/root/miniconda3/lib/python3.9/asyncio/tasks.py", line 454, in wait_for
raise exceptions.TimeoutError() from exc
asyncio.exceptions.TimeoutError
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/root/miniconda3/lib/python3.9/site-packages/distributed/worker.py", line 2654, in gather_dep
response = await get_data_from_worker(
File "/root/miniconda3/lib/python3.9/site-packages/distributed/worker.py", line 3982, in get_data_from_worker
return await retry_operation(_get_data, operation="get_data_from_worker")
File "/root/miniconda3/lib/python3.9/site-packages/distributed/utils_comm.py", line 385, in retry_operation
return await retry(
File "/root/miniconda3/lib/python3.9/site-packages/distributed/utils_comm.py", line 370, in retry
return await coro()
File "/root/miniconda3/lib/python3.9/site-packages/distributed/worker.py", line 3959, in _get_data
comm = await rpc.connect(worker)
File "/root/miniconda3/lib/python3.9/site-packages/distributed/core.py", line 1048, in connect
raise exc
File "/root/miniconda3/lib/python3.9/site-packages/distributed/core.py", line 1032, in connect
comm = await fut
File "/root/miniconda3/lib/python3.9/site-packages/distributed/comm/core.py", line 324, in connect
raise OSError(
OSError: Timed out during handshake while connecting to tcp://127.0.0.1:42233 after 30 s
distributed.worker - ERROR - Worker stream died during communication: tcp://127.0.0.1:33333
Traceback (most recent call last):
File "/root/miniconda3/lib/python3.9/site-packages/distributed/comm/tcp.py", line 245, in write
async def write(self, msg, serializers=None, on_error="message"):
asyncio.exceptions.CancelledError
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/root/miniconda3/lib/python3.9/asyncio/tasks.py", line 452, in wait_for
fut.result()
asyncio.exceptions.CancelledError
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/root/miniconda3/lib/python3.9/site-packages/distributed/comm/core.py", line 320, in connect
await asyncio.wait_for(comm.write(local_info), time_left())
File "/root/miniconda3/lib/python3.9/asyncio/tasks.py", line 454, in wait_for
raise exceptions.TimeoutError() from exc
asyncio.exceptions.TimeoutError
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/root/miniconda3/lib/python3.9/site-packages/distributed/worker.py", line 2654, in gather_dep
response = await get_data_from_worker(
File "/root/miniconda3/lib/python3.9/site-packages/distributed/worker.py", line 3982, in get_data_from_worker
return await retry_operation(_get_data, operation="get_data_from_worker")
File "/root/miniconda3/lib/python3.9/site-packages/distributed/utils_comm.py", line 385, in retry_operation
return await retry(
File "/root/miniconda3/lib/python3.9/site-packages/distributed/utils_comm.py", line 370, in retry
return await coro()
File "/root/miniconda3/lib/python3.9/site-packages/distributed/worker.py", line 3959, in _get_data
comm = await rpc.connect(worker)
File "/root/miniconda3/lib/python3.9/site-packages/distributed/core.py", line 1048, in connect
raise exc
File "/root/miniconda3/lib/python3.9/site-packages/distributed/core.py", line 1032, in connect
comm = await fut
File "/root/miniconda3/lib/python3.9/site-packages/distributed/comm/core.py", line 324, in connect
raise OSError(
OSError: Timed out during handshake while connecting to tcp://127.0.0.1:33333 after 30 s
distributed.worker - ERROR - Worker stream died during communication: tcp://127.0.0.1:38215
Traceback (most recent call last):
File "/root/miniconda3/lib/python3.9/site-packages/distributed/comm/tcp.py", line 245, in write
async def write(self, msg, serializers=None, on_error="message"):
asyncio.exceptions.CancelledError
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/root/miniconda3/lib/python3.9/asyncio/tasks.py", line 452, in wait_for
fut.result()
asyncio.exceptions.CancelledError
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/root/miniconda3/lib/python3.9/site-packages/distributed/comm/core.py", line 320, in connect
await asyncio.wait_for(comm.write(local_info), time_left())
File "/root/miniconda3/lib/python3.9/asyncio/tasks.py", line 454, in wait_for
raise exceptions.TimeoutError() from exc
asyncio.exceptions.TimeoutError
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/root/miniconda3/lib/python3.9/site-packages/distributed/worker.py", line 2654, in gather_dep
response = await get_data_from_worker(
File "/root/miniconda3/lib/python3.9/site-packages/distributed/worker.py", line 3982, in get_data_from_worker
return await retry_operation(_get_data, operation="get_data_from_worker")
File "/root/miniconda3/lib/python3.9/site-packages/distributed/utils_comm.py", line 385, in retry_operation
return await retry(
File "/root/miniconda3/lib/python3.9/site-packages/distributed/utils_comm.py", line 370, in retry
return await coro()
File "/root/miniconda3/lib/python3.9/site-packages/distributed/worker.py", line 3959, in _get_data
comm = await rpc.connect(worker)
File "/root/miniconda3/lib/python3.9/site-packages/distributed/core.py", line 1048, in connect
raise exc
File "/root/miniconda3/lib/python3.9/site-packages/distributed/core.py", line 1032, in connect
comm = await fut
File "/root/miniconda3/lib/python3.9/site-packages/distributed/comm/core.py", line 324, in connect
raise OSError(
OSError: Timed out during handshake while connecting to tcp://127.0.0.1:38215 after 30 s
distributed.worker - ERROR - Worker stream died during communication: tcp://127.0.0.1:33333
Traceback (most recent call last):
File "/root/miniconda3/lib/python3.9/site-packages/distributed/comm/tcp.py", line 245, in write
async def write(self, msg, serializers=None, on_error="message"):
asyncio.exceptions.CancelledError
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/root/miniconda3/lib/python3.9/asyncio/tasks.py", line 452, in wait_for
fut.result()
asyncio.exceptions.CancelledError
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/root/miniconda3/lib/python3.9/site-packages/distributed/comm/core.py", line 320, in connect
await asyncio.wait_for(comm.write(local_info), time_left())
File "/root/miniconda3/lib/python3.9/asyncio/tasks.py", line 454, in wait_for
raise exceptions.TimeoutError() from exc
asyncio.exceptions.TimeoutError
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/root/miniconda3/lib/python3.9/site-packages/distributed/worker.py", line 2654, in gather_dep
response = await get_data_from_worker(
File "/root/miniconda3/lib/python3.9/site-packages/distributed/worker.py", line 3982, in get_data_from_worker
return await retry_operation(_get_data, operation="get_data_from_worker")
File "/root/miniconda3/lib/python3.9/site-packages/distributed/utils_comm.py", line 385, in retry_operation
return await retry(
File "/root/miniconda3/lib/python3.9/site-packages/distributed/utils_comm.py", line 370, in retry
return await coro()
File "/root/miniconda3/lib/python3.9/site-packages/distributed/worker.py", line 3959, in _get_data
comm = await rpc.connect(worker)
File "/root/miniconda3/lib/python3.9/site-packages/distributed/core.py", line 1048, in connect
raise exc
File "/root/miniconda3/lib/python3.9/site-packages/distributed/core.py", line 1032, in connect
comm = await fut
File "/root/miniconda3/lib/python3.9/site-packages/distributed/comm/core.py", line 324, in connect
raise OSError(
OSError: Timed out during handshake while connecting to tcp://127.0.0.1:33333 after 30 s
distributed.worker - ERROR - Worker stream died during communication: tcp://127.0.0.1:33333
Traceback (most recent call last):
File "/root/miniconda3/lib/python3.9/site-packages/distributed/comm/tcp.py", line 245, in write
async def write(self, msg, serializers=None, on_error="message"):
asyncio.exceptions.CancelledError
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/root/miniconda3/lib/python3.9/asyncio/tasks.py", line 452, in wait_for
fut.result()
asyncio.exceptions.CancelledError
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/root/miniconda3/lib/python3.9/site-packages/distributed/comm/core.py", line 320, in connect
await asyncio.wait_for(comm.write(local_info), time_left())
File "/root/miniconda3/lib/python3.9/asyncio/tasks.py", line 454, in wait_for
raise exceptions.TimeoutError() from exc
asyncio.exceptions.TimeoutError
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/root/miniconda3/lib/python3.9/site-packages/distributed/worker.py", line 2654, in gather_dep
response = await get_data_from_worker(
File "/root/miniconda3/lib/python3.9/site-packages/distributed/worker.py", line 3982, in get_data_from_worker
return await retry_operation(_get_data, operation="get_data_from_worker")
File "/root/miniconda3/lib/python3.9/site-packages/distributed/utils_comm.py", line 385, in retry_operation
return await retry(
File "/root/miniconda3/lib/python3.9/site-packages/distributed/utils_comm.py", line 370, in retry
return await coro()
File "/root/miniconda3/lib/python3.9/site-packages/distributed/worker.py", line 3959, in _get_data
comm = await rpc.connect(worker)
File "/root/miniconda3/lib/python3.9/site-packages/distributed/core.py", line 1048, in connect
raise exc
File "/root/miniconda3/lib/python3.9/site-packages/distributed/core.py", line 1032, in connect
comm = await fut
File "/root/miniconda3/lib/python3.9/site-packages/distributed/comm/core.py", line 324, in connect
raise OSError(
OSError: Timed out during handshake while connecting to tcp://127.0.0.1:33333 after 30 s
distributed.worker - ERROR - Worker stream died during communication: tcp://127.0.0.1:36893
Traceback (most recent call last):
File "/root/miniconda3/lib/python3.9/site-packages/distributed/comm/tcp.py", line 205, in read
frames_nbytes = await stream.read_bytes(fmt_size)
asyncio.exceptions.CancelledError
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/root/miniconda3/lib/python3.9/asyncio/tasks.py", line 492, in wait_for
fut.result()
asyncio.exceptions.CancelledError
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/root/miniconda3/lib/python3.9/site-packages/distributed/comm/core.py", line 319, in connect
handshake = await asyncio.wait_for(comm.read(), time_left())
File "/root/miniconda3/lib/python3.9/asyncio/tasks.py", line 494, in wait_for
raise exceptions.TimeoutError() from exc
asyncio.exceptions.TimeoutError
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/root/miniconda3/lib/python3.9/site-packages/distributed/worker.py", line 2654, in gather_dep
response = await get_data_from_worker(
File "/root/miniconda3/lib/python3.9/site-packages/distributed/worker.py", line 3982, in get_data_from_worker
return await retry_operation(_get_data, operation="get_data_from_worker")
File "/root/miniconda3/lib/python3.9/site-packages/distributed/utils_comm.py", line 385, in retry_operation
return await retry(
File "/root/miniconda3/lib/python3.9/site-packages/distributed/utils_comm.py", line 370, in retry
return await coro()
File "/root/miniconda3/lib/python3.9/site-packages/distributed/worker.py", line 3959, in _get_data
comm = await rpc.connect(worker)
File "/root/miniconda3/lib/python3.9/site-packages/distributed/core.py", line 1048, in connect
raise exc
File "/root/miniconda3/lib/python3.9/site-packages/distributed/core.py", line 1032, in connect
comm = await fut
File "/root/miniconda3/lib/python3.9/site-packages/distributed/comm/core.py", line 324, in connect
raise OSError(
OSError: Timed out during handshake while connecting to tcp://127.0.0.1:36893 after 30 s
distributed.worker - ERROR - Worker stream died during communication: tcp://127.0.0.1:41795
Traceback (most recent call last):
File "/root/miniconda3/lib/python3.9/site-packages/distributed/comm/tcp.py", line 205, in read
frames_nbytes = await stream.read_bytes(fmt_size)
asyncio.exceptions.CancelledError
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/root/miniconda3/lib/python3.9/asyncio/tasks.py", line 492, in wait_for
fut.result()
asyncio.exceptions.CancelledError
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/root/miniconda3/lib/python3.9/site-packages/distributed/comm/core.py", line 319, in connect
handshake = await asyncio.wait_for(comm.read(), time_left())
File "/root/miniconda3/lib/python3.9/asyncio/tasks.py", line 494, in wait_for
raise exceptions.TimeoutError() from exc
asyncio.exceptions.TimeoutError
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/root/miniconda3/lib/python3.9/site-packages/distributed/worker.py", line 2654, in gather_dep
response = await get_data_from_worker(
File "/root/miniconda3/lib/python3.9/site-packages/distributed/worker.py", line 3982, in get_data_from_worker
return await retry_operation(_get_data, operation="get_data_from_worker")
File "/root/miniconda3/lib/python3.9/site-packages/distributed/utils_comm.py", line 385, in retry_operation
return await retry(
File "/root/miniconda3/lib/python3.9/site-packages/distributed/utils_comm.py", line 370, in retry
return await coro()
File "/root/miniconda3/lib/python3.9/site-packages/distributed/worker.py", line 3959, in _get_data
comm = await rpc.connect(worker)
File "/root/miniconda3/lib/python3.9/site-packages/distributed/core.py", line 1048, in connect
raise exc
File "/root/miniconda3/lib/python3.9/site-packages/distributed/core.py", line 1032, in connect
comm = await fut
File "/root/miniconda3/lib/python3.9/site-packages/distributed/comm/core.py", line 324, in connect
raise OSError(
OSError: Timed out during handshake while connecting to tcp://127.0.0.1:41795 after 30 s
distributed.worker - ERROR - Worker stream died during communication: tcp://127.0.0.1:41795
Traceback (most recent call last):
File "/root/miniconda3/lib/python3.9/site-packages/distributed/comm/tcp.py", line 245, in write
async def write(self, msg, serializers=None, on_error="message"):
asyncio.exceptions.CancelledError
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/root/miniconda3/lib/python3.9/asyncio/tasks.py", line 452, in wait_for
fut.result()
asyncio.exceptions.CancelledError
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/root/miniconda3/lib/python3.9/site-packages/distributed/comm/core.py", line 320, in connect
await asyncio.wait_for(comm.write(local_info), time_left())
File "/root/miniconda3/lib/python3.9/asyncio/tasks.py", line 454, in wait_for
raise exceptions.TimeoutError() from exc
asyncio.exceptions.TimeoutError
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/root/miniconda3/lib/python3.9/site-packages/distributed/worker.py", line 2654, in gather_dep
response = await get_data_from_worker(
File "/root/miniconda3/lib/python3.9/site-packages/distributed/worker.py", line 3982, in get_data_from_worker
return await retry_operation(_get_data, operation="get_data_from_worker")
File "/root/miniconda3/lib/python3.9/site-packages/distributed/utils_comm.py", line 385, in retry_operation
return await retry(
File "/root/miniconda3/lib/python3.9/site-packages/distributed/utils_comm.py", line 370, in retry
return await coro()
File "/root/miniconda3/lib/python3.9/site-packages/distributed/worker.py", line 3959, in _get_data
comm = await rpc.connect(worker)
File "/root/miniconda3/lib/python3.9/site-packages/distributed/core.py", line 1048, in connect
raise exc
File "/root/miniconda3/lib/python3.9/site-packages/distributed/core.py", line 1032, in connect
comm = await fut
File "/root/miniconda3/lib/python3.9/site-packages/distributed/comm/core.py", line 324, in connect
raise OSError(
OSError: Timed out during handshake while connecting to tcp://127.0.0.1:41795 after 30 s
distributed.worker - ERROR - Worker stream died during communication: tcp://127.0.0.1:41795
Traceback (most recent call last):
File "/root/miniconda3/lib/python3.9/site-packages/distributed/comm/tcp.py", line 245, in write
async def write(self, msg, serializers=None, on_error="message"):
asyncio.exceptions.CancelledError
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/root/miniconda3/lib/python3.9/asyncio/tasks.py", line 452, in wait_for
fut.result()
asyncio.exceptions.CancelledError
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/root/miniconda3/lib/python3.9/site-packages/distributed/comm/core.py", line 320, in connect
await asyncio.wait_for(comm.write(local_info), time_left())
File "/root/miniconda3/lib/python3.9/asyncio/tasks.py", line 454, in wait_for
raise exceptions.TimeoutError() from exc
asyncio.exceptions.TimeoutError
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/root/miniconda3/lib/python3.9/site-packages/distributed/worker.py", line 2654, in gather_dep
response = await get_data_from_worker(
File "/root/miniconda3/lib/python3.9/site-packages/distributed/worker.py", line 3982, in get_data_from_worker
return await retry_operation(_get_data, operation="get_data_from_worker")
File "/root/miniconda3/lib/python3.9/site-packages/distributed/utils_comm.py", line 385, in retry_operation
return await retry(
File "/root/miniconda3/lib/python3.9/site-packages/distributed/utils_comm.py", line 370, in retry
return await coro()
File "/root/miniconda3/lib/python3.9/site-packages/distributed/worker.py", line 3959, in _get_data
comm = await rpc.connect(worker)
File "/root/miniconda3/lib/python3.9/site-packages/distributed/core.py", line 1048, in connect
raise exc
File "/root/miniconda3/lib/python3.9/site-packages/distributed/core.py", line 1032, in connect
comm = await fut
File "/root/miniconda3/lib/python3.9/site-packages/distributed/comm/core.py", line 324, in connect
raise OSError(
OSError: Timed out during handshake while connecting to tcp://127.0.0.1:41795 after 30 s
distributed.worker - ERROR - Worker stream died during communication: tcp://127.0.0.1:38707
Traceback (most recent call last):
File "/root/miniconda3/lib/python3.9/site-packages/distributed/comm/tcp.py", line 205, in read
frames_nbytes = await stream.read_bytes(fmt_size)
asyncio.exceptions.CancelledError
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/root/miniconda3/lib/python3.9/asyncio/tasks.py", line 492, in wait_for
fut.result()
asyncio.exceptions.CancelledError
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/root/miniconda3/lib/python3.9/site-packages/distributed/comm/core.py", line 319, in connect
handshake = await asyncio.wait_for(comm.read(), time_left())
File "/root/miniconda3/lib/python3.9/asyncio/tasks.py", line 494, in wait_for
raise exceptions.TimeoutError() from exc
asyncio.exceptions.TimeoutError
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/root/miniconda3/lib/python3.9/site-packages/distributed/worker.py", line 2654, in gather_dep
response = await get_data_from_worker(
File "/root/miniconda3/lib/python3.9/site-packages/distributed/worker.py", line 3982, in get_data_from_worker
return await retry_operation(_get_data, operation="get_data_from_worker")
File "/root/miniconda3/lib/python3.9/site-packages/distributed/utils_comm.py", line 385, in retry_operation
return await retry(
File "/root/miniconda3/lib/python3.9/site-packages/distributed/utils_comm.py", line 370, in retry
return await coro()
File "/root/miniconda3/lib/python3.9/site-packages/distributed/worker.py", line 3959, in _get_data
comm = await rpc.connect(worker)
File "/root/miniconda3/lib/python3.9/site-packages/distributed/core.py", line 1048, in connect
raise exc
File "/root/miniconda3/lib/python3.9/site-packages/distributed/core.py", line 1032, in connect
comm = await fut
File "/root/miniconda3/lib/python3.9/site-packages/distributed/comm/core.py", line 324, in connect
raise OSError(
OSError: Timed out during handshake while connecting to tcp://127.0.0.1:38707 after 30 s
distributed.utils - ERROR - ("('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)", [("('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)", 'register-replica', 'released', 'compute-task-1635104458.1960478', 1635104468.872155), ("('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)", 'released', 'fetch', {}, 'compute-task-1635104458.1960478', 1635104468.8722844), ('gather-dependencies', 'tcp://127.0.0.1:41795', {"('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)"}, 'stimulus', 1635104468.872853), ("('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)", 'fetch', 'flight', {}, 'ensure-communicating-1635104468.8727045', 1635104468.8728676), ('request-dep', 'tcp://127.0.0.1:41795', {"('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)"}, 'ensure-communicating-1635104468.8727045', 1635104470.384826), ('receive-dep-failed', 'tcp://127.0.0.1:41795', {"('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)"}, 'ensure-communicating-1635104468.8727045', 1635104503.914738), ("('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)", 'missing-dep'), ("('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)", 'flight', 'fetch', {}, 'ensure-communicating-1635104468.8727045', 1635104503.9147604), ("('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)", 'fetch', 'missing', {}, 'ensure-communicating-1635104503.91477', 1635104503.9147818), ("('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)", 'missing', 'fetch', {}, 'find-missing-1635104504.5391095', 1635104504.5398495), ('gather-dependencies', 'tcp://127.0.0.1:41795', {"('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)"}, 'stimulus', 1635104504.5398893), ("('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)", 'fetch', 'flight', {}, 'ensure-communicating-1635104504.5398538', 1635104504.5398993), ('request-dep', 'tcp://127.0.0.1:41795', {"('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)"}, 'ensure-communicating-1635104504.5398538', 1635104504.5399654), ("('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)", 'flight', 'cancelled', {}, 'processing-released-1635104504.5400004', 1635104504.5406826), ("('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)", 'cancelled', 'cancelled', {"('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)": 'released'}, 'processing-released-1635104504.5400004', 1635104504.5406876), ("('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)", 'flight', 'cancelled', {"('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)": 'released'}, 'processing-released-1635104504.5400004', 1635104504.540689), ("('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)", 'release-key', 'processing-released-1635104504.5400004'), ("('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)", 'cancelled', 'released', {"('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)": 'forgotten'}, 'processing-released-1635104504.5400004', 1635104504.5407002), ("('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)", 'released', 'forgotten', {}, 'processing-released-1635104504.5400004', 1635104504.5407038), ('lost-during-gather', {"('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)"}, 'ensure-communicating-1635104504.5398538'), ('receive-dep', 'tcp://127.0.0.1:41795', {"('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)"}, 'ensure-communicating-1635104504.5398538', 1635104548.287821)])
Traceback (most recent call last):
File "/root/miniconda3/lib/python3.9/site-packages/distributed/utils.py", line 648, in log_errors
yield
File "/root/miniconda3/lib/python3.9/site-packages/distributed/worker.py", line 2745, in gather_dep
assert ts, (d, self.story(d))
AssertionError: ("('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)", [("('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)", 'register-replica', 'released', 'compute-task-1635104458.1960478', 1635104468.872155), ("('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)", 'released', 'fetch', {}, 'compute-task-1635104458.1960478', 1635104468.8722844), ('gather-dependencies', 'tcp://127.0.0.1:41795', {"('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)"}, 'stimulus', 1635104468.872853), ("('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)", 'fetch', 'flight', {}, 'ensure-communicating-1635104468.8727045', 1635104468.8728676), ('request-dep', 'tcp://127.0.0.1:41795', {"('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)"}, 'ensure-communicating-1635104468.8727045', 1635104470.384826), ('receive-dep-failed', 'tcp://127.0.0.1:41795', {"('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)"}, 'ensure-communicating-1635104468.8727045', 1635104503.914738), ("('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)", 'missing-dep'), ("('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)", 'flight', 'fetch', {}, 'ensure-communicating-1635104468.8727045', 1635104503.9147604), ("('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)", 'fetch', 'missing', {}, 'ensure-communicating-1635104503.91477', 1635104503.9147818), ("('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)", 'missing', 'fetch', {}, 'find-missing-1635104504.5391095', 1635104504.5398495), ('gather-dependencies', 'tcp://127.0.0.1:41795', {"('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)"}, 'stimulus', 1635104504.5398893), ("('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)", 'fetch', 'flight', {}, 'ensure-communicating-1635104504.5398538', 1635104504.5398993), ('request-dep', 'tcp://127.0.0.1:41795', {"('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)"}, 'ensure-communicating-1635104504.5398538', 1635104504.5399654), ("('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)", 'flight', 'cancelled', {}, 'processing-released-1635104504.5400004', 1635104504.5406826), ("('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)", 'cancelled', 'cancelled', {"('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)": 'released'}, 'processing-released-1635104504.5400004', 1635104504.5406876), ("('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)", 'flight', 'cancelled', {"('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)": 'released'}, 'processing-released-1635104504.5400004', 1635104504.540689), ("('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)", 'release-key', 'processing-released-1635104504.5400004'), ("('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)", 'cancelled', 'released', {"('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)": 'forgotten'}, 'processing-released-1635104504.5400004', 1635104504.5407002), ("('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)", 'released', 'forgotten', {}, 'processing-released-1635104504.5400004', 1635104504.5407038), ('lost-during-gather', {"('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)"}, 'ensure-communicating-1635104504.5398538'), ('receive-dep', 'tcp://127.0.0.1:41795', {"('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)"}, 'ensure-communicating-1635104504.5398538', 1635104548.287821)])
tornado.application - ERROR - Exception in callback functools.partial(<bound method IOLoop._discard_future_result of <tornado.platform.asyncio.AsyncIOLoop object at 0x7f5f45925ac0>>, <Task finished name='Task-8548' coro=<Worker.gather_dep() done, defined at /root/miniconda3/lib/python3.9/site-packages/distributed/worker.py:2588> exception=AssertionError(("('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)", [("('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)", 'register-replica', 'released', 'compute-task-1635104458.1960478', 1635104468.872155), ("('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)", 'released', 'fetch', {}, 'compute-task-1635104458.1960478', 1635104468.8722844), ('gather-dependencies', 'tcp://127.0.0.1:41795', {"('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)"}, 'stimulus', 1635104468.872853), ("('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)", 'fetch', 'flight', {}, 'ensure-communicating-1635104468.8727045', 1635104468.8728676), ('request-dep', 'tcp://127.0.0.1:41795', {"('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)"}, 'ensure-communicating-1635104468.8727045', 1635104470.384826), ('receive-dep-failed', 'tcp://127.0.0.1:41795', {"('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)"}, 'ensure-communicating-1635104468.8727045', 1635104503.914738), ("('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)", 'missing-dep'), ("('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)", 'flight', 'fetch', {}, 'ensure-communicating-1635104468.8727045', 1635104503.9147604), ("('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)", 'fetch', 'missing', {}, 'ensure-communicating-1635104503.91477', 1635104503.9147818), ("('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)", 'missing', 'fetch', {}, 'find-missing-1635104504.5391095', 1635104504.5398495), ('gather-dependencies', 'tcp://127.0.0.1:41795', {"('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)"}, 'stimulus', 1635104504.5398893), ("('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)", 'fetch', 'flight', {}, 'ensure-communicating-1635104504.5398538', 1635104504.5398993), ('request-dep', 'tcp://127.0.0.1:41795', {"('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)"}, 'ensure-communicating-1635104504.5398538', 1635104504.5399654), ("('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)", 'flight', 'cancelled', {}, 'processing-released-1635104504.5400004', 1635104504.5406826), ("('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)", 'cancelled', 'cancelled', {"('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)": 'released'}, 'processing-released-1635104504.5400004', 1635104504.5406876), ("('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)", 'flight', 'cancelled', {"('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)": 'released'}, 'processing-released-1635104504.5400004', 1635104504.540689), ("('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)", 'release-key', 'processing-released-1635104504.5400004'), ("('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)", 'cancelled', 'released', {"('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)": 'forgotten'}, 'processing-released-1635104504.5400004', 1635104504.5407002), ("('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)", 'released', 'forgotten', {}, 'processing-released-1635104504.5400004', 1635104504.5407038), ('lost-during-gather', {"('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)"}, 'ensure-communicating-1635104504.5398538'), ('receive-dep', 'tcp://127.0.0.1:41795', {"('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)"}, 'ensure-communicating-1635104504.5398538', 1635104548.287821)]))>)
Traceback (most recent call last):
File "/root/miniconda3/lib/python3.9/site-packages/tornado/ioloop.py", line 741, in _run_callback
ret = callback()
File "/root/miniconda3/lib/python3.9/site-packages/tornado/ioloop.py", line 765, in _discard_future_result
future.result()
File "/root/miniconda3/lib/python3.9/site-packages/distributed/worker.py", line 2745, in gather_dep
assert ts, (d, self.story(d))
AssertionError: ("('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)", [("('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)", 'register-replica', 'released', 'compute-task-1635104458.1960478', 1635104468.872155), ("('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)", 'released', 'fetch', {}, 'compute-task-1635104458.1960478', 1635104468.8722844), ('gather-dependencies', 'tcp://127.0.0.1:41795', {"('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)"}, 'stimulus', 1635104468.872853), ("('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)", 'fetch', 'flight', {}, 'ensure-communicating-1635104468.8727045', 1635104468.8728676), ('request-dep', 'tcp://127.0.0.1:41795', {"('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)"}, 'ensure-communicating-1635104468.8727045', 1635104470.384826), ('receive-dep-failed', 'tcp://127.0.0.1:41795', {"('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)"}, 'ensure-communicating-1635104468.8727045', 1635104503.914738), ("('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)", 'missing-dep'), ("('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)", 'flight', 'fetch', {}, 'ensure-communicating-1635104468.8727045', 1635104503.9147604), ("('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)", 'fetch', 'missing', {}, 'ensure-communicating-1635104503.91477', 1635104503.9147818), ("('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)", 'missing', 'fetch', {}, 'find-missing-1635104504.5391095', 1635104504.5398495), ('gather-dependencies', 'tcp://127.0.0.1:41795', {"('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)"}, 'stimulus', 1635104504.5398893), ("('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)", 'fetch', 'flight', {}, 'ensure-communicating-1635104504.5398538', 1635104504.5398993), ('request-dep', 'tcp://127.0.0.1:41795', {"('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)"}, 'ensure-communicating-1635104504.5398538', 1635104504.5399654), ("('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)", 'flight', 'cancelled', {}, 'processing-released-1635104504.5400004', 1635104504.5406826), ("('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)", 'cancelled', 'cancelled', {"('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)": 'released'}, 'processing-released-1635104504.5400004', 1635104504.5406876), ("('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)", 'flight', 'cancelled', {"('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)": 'released'}, 'processing-released-1635104504.5400004', 1635104504.540689), ("('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)", 'release-key', 'processing-released-1635104504.5400004'), ("('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)", 'cancelled', 'released', {"('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)": 'forgotten'}, 'processing-released-1635104504.5400004', 1635104504.5407002), ("('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)", 'released', 'forgotten', {}, 'processing-released-1635104504.5400004', 1635104504.5407038), ('lost-during-gather', {"('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)"}, 'ensure-communicating-1635104504.5398538'), ('receive-dep', 'tcp://127.0.0.1:41795', {"('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)"}, 'ensure-communicating-1635104504.5398538', 1635104548.287821)])
Another error during `to_parquet`
distributed.worker - ERROR - Worker stream died during communication: tcp://127.0.0.1:37667
Traceback (most recent call last):
File "/root/miniconda3/lib/python3.9/site-packages/tornado/iostream.py", line 867, in _read_to_buffer
bytes_read = self.read_from_fd(buf)
File "/root/miniconda3/lib/python3.9/site-packages/tornado/iostream.py", line 1140, in read_from_fd
return self.socket.recv_into(buf, len(buf))
ConnectionResetError: [Errno 104] Connection reset by peer
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/root/miniconda3/lib/python3.9/site-packages/distributed/worker.py", line 2741, in gather_dep
response = await get_data_from_worker(
File "/root/miniconda3/lib/python3.9/site-packages/distributed/worker.py", line 4031, in get_data_from_worker
return await retry_operation(_get_data, operation="get_data_from_worker")
File "/root/miniconda3/lib/python3.9/site-packages/distributed/utils_comm.py", line 385, in retry_operation
return await retry(
File "/root/miniconda3/lib/python3.9/site-packages/distributed/utils_comm.py", line 370, in retry
return await coro()
File "/root/miniconda3/lib/python3.9/site-packages/distributed/worker.py", line 4011, in _get_data
response = await send_recv(
File "/root/miniconda3/lib/python3.9/site-packages/distributed/core.py", line 640, in send_recv
response = await comm.read(deserializers=deserializers)
File "/root/miniconda3/lib/python3.9/site-packages/distributed/comm/tcp.py", line 221, in read
convert_stream_closed_error(self, e)
File "/root/miniconda3/lib/python3.9/site-packages/distributed/comm/tcp.py", line 126, in convert_stream_closed_error
raise CommClosedError(f"in {obj}: {exc.__class__.__name__}: {exc}") from exc
distributed.comm.core.CommClosedError: in <TCP (closed) Ephemeral Worker->Worker for gather local=tcp://127.0.0.1:37530 remote=tcp://127.0.0.1:37667>: ConnectionResetError: [Errno 104] Connection reset by peer
distributed.worker - ERROR - failed during get data with tcp://127.0.0.1:37667 -> tcp://127.0.0.1:34511
Traceback (most recent call last):
File "/root/miniconda3/lib/python3.9/site-packages/tornado/iostream.py", line 867, in _read_to_buffer
bytes_read = self.read_from_fd(buf)
File "/root/miniconda3/lib/python3.9/site-packages/tornado/iostream.py", line 1140, in read_from_fd
return self.socket.recv_into(buf, len(buf))
TimeoutError: [Errno 110] Connection timed out
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/root/miniconda3/lib/python3.9/site-packages/distributed/worker.py", line 1649, in get_data
response = await comm.read(deserializers=serializers)
File "/root/miniconda3/lib/python3.9/site-packages/distributed/comm/tcp.py", line 221, in read
convert_stream_closed_error(self, e)
File "/root/miniconda3/lib/python3.9/site-packages/distributed/comm/tcp.py", line 126, in convert_stream_closed_error
raise CommClosedError(f"in {obj}: {exc.__class__.__name__}: {exc}") from exc
distributed.comm.core.CommClosedError: in <TCP (closed) local=tcp://127.0.0.1:37667 remote=tcp://127.0.0.1:37530>: TimeoutError: [Errno 110] Connection timed out
distributed.worker - ERROR - Worker stream died during communication: tcp://127.0.0.1:36281
Traceback (most recent call last):
File "/root/miniconda3/lib/python3.9/site-packages/tornado/iostream.py", line 867, in _read_to_buffer
bytes_read = self.read_from_fd(buf)
File "/root/miniconda3/lib/python3.9/site-packages/tornado/iostream.py", line 1140, in read_from_fd
return self.socket.recv_into(buf, len(buf))
ConnectionResetError: [Errno 104] Connection reset by peer
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/root/miniconda3/lib/python3.9/site-packages/distributed/worker.py", line 2741, in gather_dep
response = await get_data_from_worker(
File "/root/miniconda3/lib/python3.9/site-packages/distributed/worker.py", line 4031, in get_data_from_worker
return await retry_operation(_get_data, operation="get_data_from_worker")
File "/root/miniconda3/lib/python3.9/site-packages/distributed/utils_comm.py", line 385, in retry_operation
return await retry(
File "/root/miniconda3/lib/python3.9/site-packages/distributed/utils_comm.py", line 370, in retry
return await coro()
File "/root/miniconda3/lib/python3.9/site-packages/distributed/worker.py", line 4011, in _get_data
response = await send_recv(
File "/root/miniconda3/lib/python3.9/site-packages/distributed/core.py", line 640, in send_recv
response = await comm.read(deserializers=deserializers)
File "/root/miniconda3/lib/python3.9/site-packages/distributed/comm/tcp.py", line 221, in read
convert_stream_closed_error(self, e)
File "/root/miniconda3/lib/python3.9/site-packages/distributed/comm/tcp.py", line 126, in convert_stream_closed_error
raise CommClosedError(f"in {obj}: {exc.__class__.__name__}: {exc}") from exc
distributed.comm.core.CommClosedError: in <TCP (closed) Ephemeral Worker->Worker for gather local=tcp://127.0.0.1:45000 remote=tcp://127.0.0.1:36281>: ConnectionResetError: [Errno 104] Connection reset by peer
distributed.worker - ERROR - failed during get data with tcp://127.0.0.1:36281 -> tcp://127.0.0.1:34511
Traceback (most recent call last):
File "/root/miniconda3/lib/python3.9/site-packages/tornado/iostream.py", line 867, in _read_to_buffer
bytes_read = self.read_from_fd(buf)
File "/root/miniconda3/lib/python3.9/site-packages/tornado/iostream.py", line 1140, in read_from_fd
return self.socket.recv_into(buf, len(buf))
TimeoutError: [Errno 110] Connection timed out
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/root/miniconda3/lib/python3.9/site-packages/distributed/worker.py", line 1649, in get_data
response = await comm.read(deserializers=serializers)
File "/root/miniconda3/lib/python3.9/site-packages/distributed/comm/tcp.py", line 221, in read
convert_stream_closed_error(self, e)
File "/root/miniconda3/lib/python3.9/site-packages/distributed/comm/tcp.py", line 126, in convert_stream_closed_error
raise CommClosedError(f"in {obj}: {exc.__class__.__name__}: {exc}") from exc
distributed.comm.core.CommClosedError: in <TCP (closed) local=tcp://127.0.0.1:36281 remote=tcp://127.0.0.1:45000>: TimeoutError: [Errno 110] Connection timed out
tornado.application - ERROR - Exception in callback functools.partial(<bound method IOLoop._discard_future_result of <tornado.platform.asyncio.AsyncIOLoop object at 0x7f348fe7bac0>>, <Task finished name='Task-9116' coro=<Worker.heartbeat() done, defined at /root/miniconda3/lib/python3.9/site-packages/distributed/worker.py:1182> exception=OSError('Timed out during handshake while connecting to tcp://127.0.0.1:43431 after 30 s')>)
Traceback (most recent call last):
File "/root/miniconda3/lib/python3.9/site-packages/distributed/comm/tcp.py", line 245, in write
async def write(self, msg, serializers=None, on_error="message"):
asyncio.exceptions.CancelledError
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/root/miniconda3/lib/python3.9/asyncio/tasks.py", line 452, in wait_for
fut.result()
asyncio.exceptions.CancelledError
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/root/miniconda3/lib/python3.9/site-packages/distributed/comm/core.py", line 320, in connect
await asyncio.wait_for(comm.write(local_info), time_left())
File "/root/miniconda3/lib/python3.9/asyncio/tasks.py", line 454, in wait_for
raise exceptions.TimeoutError() from exc
asyncio.exceptions.TimeoutError
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/root/miniconda3/lib/python3.9/site-packages/tornado/ioloop.py", line 741, in _run_callback
ret = callback()
File "/root/miniconda3/lib/python3.9/site-packages/tornado/ioloop.py", line 765, in _discard_future_result
future.result()
File "/root/miniconda3/lib/python3.9/site-packages/distributed/worker.py", line 1231, in heartbeat
raise e
File "/root/miniconda3/lib/python3.9/site-packages/distributed/worker.py", line 1190, in heartbeat
response = await retry_operation(
File "/root/miniconda3/lib/python3.9/site-packages/distributed/utils_comm.py", line 385, in retry_operation
return await retry(
File "/root/miniconda3/lib/python3.9/site-packages/distributed/utils_comm.py", line 370, in retry
return await coro()
File "/root/miniconda3/lib/python3.9/site-packages/distributed/core.py", line 860, in send_recv_from_rpc
comm = await self.pool.connect(self.addr)
File "/root/miniconda3/lib/python3.9/site-packages/distributed/core.py", line 1048, in connect
raise exc
File "/root/miniconda3/lib/python3.9/site-packages/distributed/core.py", line 1032, in connect
comm = await fut
File "/root/miniconda3/lib/python3.9/site-packages/distributed/comm/core.py", line 324, in connect
raise OSError(
OSError: Timed out during handshake while connecting to tcp://127.0.0.1:43431 after 30 s
distributed.worker - ERROR - failed during get data with tcp://127.0.0.1:36281 -> tcp://127.0.0.1:37667
Traceback (most recent call last):
File "/root/miniconda3/lib/python3.9/site-packages/tornado/iostream.py", line 867, in _read_to_buffer
bytes_read = self.read_from_fd(buf)
File "/root/miniconda3/lib/python3.9/site-packages/tornado/iostream.py", line 1140, in read_from_fd
return self.socket.recv_into(buf, len(buf))
TimeoutError: [Errno 110] Connection timed out
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/root/miniconda3/lib/python3.9/site-packages/distributed/worker.py", line 1649, in get_data
response = await comm.read(deserializers=serializers)
File "/root/miniconda3/lib/python3.9/site-packages/distributed/comm/tcp.py", line 221, in read
convert_stream_closed_error(self, e)
File "/root/miniconda3/lib/python3.9/site-packages/distributed/comm/tcp.py", line 126, in convert_stream_closed_error
raise CommClosedError(f"in {obj}: {exc.__class__.__name__}: {exc}") from exc
distributed.comm.core.CommClosedError: in <TCP (closed) local=tcp://127.0.0.1:36281 remote=tcp://127.0.0.1:44898>: TimeoutError: [Errno 110] Connection timed out
distributed.worker - ERROR - Worker stream died during communication: tcp://127.0.0.1:36281
Traceback (most recent call last):
File "/root/miniconda3/lib/python3.9/site-packages/tornado/iostream.py", line 867, in _read_to_buffer
bytes_read = self.read_from_fd(buf)
File "/root/miniconda3/lib/python3.9/site-packages/tornado/iostream.py", line 1140, in read_from_fd
return self.socket.recv_into(buf, len(buf))
ConnectionResetError: [Errno 104] Connection reset by peer
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/root/miniconda3/lib/python3.9/site-packages/distributed/worker.py", line 2741, in gather_dep
response = await get_data_from_worker(
File "/root/miniconda3/lib/python3.9/site-packages/distributed/worker.py", line 4031, in get_data_from_worker
return await retry_operation(_get_data, operation="get_data_from_worker")
File "/root/miniconda3/lib/python3.9/site-packages/distributed/utils_comm.py", line 385, in retry_operation
return await retry(
File "/root/miniconda3/lib/python3.9/site-packages/distributed/utils_comm.py", line 370, in retry
return await coro()
File "/root/miniconda3/lib/python3.9/site-packages/distributed/worker.py", line 4011, in _get_data
response = await send_recv(
File "/root/miniconda3/lib/python3.9/site-packages/distributed/core.py", line 640, in send_recv
response = await comm.read(deserializers=deserializers)
File "/root/miniconda3/lib/python3.9/site-packages/distributed/comm/tcp.py", line 221, in read
convert_stream_closed_error(self, e)
File "/root/miniconda3/lib/python3.9/site-packages/distributed/comm/tcp.py", line 126, in convert_stream_closed_error
raise CommClosedError(f"in {obj}: {exc.__class__.__name__}: {exc}") from exc
distributed.comm.core.CommClosedError: in <TCP (closed) Ephemeral Worker->Worker for gather local=tcp://127.0.0.1:44898 remote=tcp://127.0.0.1:36281>: ConnectionResetError: [Errno 104] Connection reset by peer
distributed.worker - ERROR - failed during get data with tcp://127.0.0.1:43515 -> tcp://127.0.0.1:37275
Traceback (most recent call last):
File "/root/miniconda3/lib/python3.9/site-packages/tornado/iostream.py", line 867, in _read_to_buffer
bytes_read = self.read_from_fd(buf)
File "/root/miniconda3/lib/python3.9/site-packages/tornado/iostream.py", line 1140, in read_from_fd
return self.socket.recv_into(buf, len(buf))
TimeoutError: [Errno 110] Connection timed out
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/root/miniconda3/lib/python3.9/site-packages/distributed/worker.py", line 1649, in get_data
response = await comm.read(deserializers=serializers)
File "/root/miniconda3/lib/python3.9/site-packages/distributed/comm/tcp.py", line 221, in read
convert_stream_closed_error(self, e)
File "/root/miniconda3/lib/python3.9/site-packages/distributed/comm/tcp.py", line 126, in convert_stream_closed_error
raise CommClosedError(f"in {obj}: {exc.__class__.__name__}: {exc}") from exc
distributed.comm.core.CommClosedError: in <TCP (closed) local=tcp://127.0.0.1:43515 remote=tcp://127.0.0.1:45898>: TimeoutError: [Errno 110] Connection timed out
distributed.worker - ERROR - Worker stream died during communication: tcp://127.0.0.1:43515
Traceback (most recent call last):
File "/root/miniconda3/lib/python3.9/site-packages/tornado/iostream.py", line 867, in _read_to_buffer
bytes_read = self.read_from_fd(buf)
File "/root/miniconda3/lib/python3.9/site-packages/tornado/iostream.py", line 1140, in read_from_fd
return self.socket.recv_into(buf, len(buf))
ConnectionResetError: [Errno 104] Connection reset by peer
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/root/miniconda3/lib/python3.9/site-packages/distributed/worker.py", line 2741, in gather_dep
response = await get_data_from_worker(
File "/root/miniconda3/lib/python3.9/site-packages/distributed/worker.py", line 4031, in get_data_from_worker
return await retry_operation(_get_data, operation="get_data_from_worker")
File "/root/miniconda3/lib/python3.9/site-packages/distributed/utils_comm.py", line 385, in retry_operation
return await retry(
File "/root/miniconda3/lib/python3.9/site-packages/distributed/utils_comm.py", line 370, in retry
return await coro()
File "/root/miniconda3/lib/python3.9/site-packages/distributed/worker.py", line 4011, in _get_data
response = await send_recv(
File "/root/miniconda3/lib/python3.9/site-packages/distributed/core.py", line 640, in send_recv
response = await comm.read(deserializers=deserializers)
File "/root/miniconda3/lib/python3.9/site-packages/distributed/comm/tcp.py", line 221, in read
convert_stream_closed_error(self, e)
File "/root/miniconda3/lib/python3.9/site-packages/distributed/comm/tcp.py", line 126, in convert_stream_closed_error
raise CommClosedError(f"in {obj}: {exc.__class__.__name__}: {exc}") from exc
distributed.comm.core.CommClosedError: in <TCP (closed) Ephemeral Worker->Worker for gather local=tcp://127.0.0.1:45898 remote=tcp://127.0.0.1:43515>: ConnectionResetError: [Errno 104] Connection reset by peer
distributed.worker - ERROR - Worker stream died during communication: tcp://127.0.0.1:37275
Traceback (most recent call last):
File "/root/miniconda3/lib/python3.9/site-packages/tornado/iostream.py", line 867, in _read_to_buffer
bytes_read = self.read_from_fd(buf)
File "/root/miniconda3/lib/python3.9/site-packages/tornado/iostream.py", line 1140, in read_from_fd
return self.socket.recv_into(buf, len(buf))
ConnectionResetError: [Errno 104] Connection reset by peer
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/root/miniconda3/lib/python3.9/site-packages/distributed/worker.py", line 2741, in gather_dep
response = await get_data_from_worker(
File "/root/miniconda3/lib/python3.9/site-packages/distributed/worker.py", line 4031, in get_data_from_worker
return await retry_operation(_get_data, operation="get_data_from_worker")
File "/root/miniconda3/lib/python3.9/site-packages/distributed/utils_comm.py", line 385, in retry_operation
return await retry(
File "/root/miniconda3/lib/python3.9/site-packages/distributed/utils_comm.py", line 370, in retry
return await coro()
File "/root/miniconda3/lib/python3.9/site-packages/distributed/worker.py", line 4011, in _get_data
response = await send_recv(
File "/root/miniconda3/lib/python3.9/site-packages/distributed/core.py", line 640, in send_recv
response = await comm.read(deserializers=deserializers)
File "/root/miniconda3/lib/python3.9/site-packages/distributed/comm/tcp.py", line 221, in read
convert_stream_closed_error(self, e)
File "/root/miniconda3/lib/python3.9/site-packages/distributed/comm/tcp.py", line 126, in convert_stream_closed_error
raise CommClosedError(f"in {obj}: {exc.__class__.__name__}: {exc}") from exc
distributed.comm.core.CommClosedError: in <TCP (closed) Ephemeral Worker->Worker for gather local=tcp://127.0.0.1:40650 remote=tcp://127.0.0.1:37275>: ConnectionResetError: [Errno 104] Connection reset by peer
distributed.worker - ERROR - Worker stream died during communication: tcp://127.0.0.1:43515
Traceback (most recent call last):
File "/root/miniconda3/lib/python3.9/site-packages/tornado/iostream.py", line 867, in _read_to_buffer
bytes_read = self.read_from_fd(buf)
File "/root/miniconda3/lib/python3.9/site-packages/tornado/iostream.py", line 1140, in read_from_fd
return self.socket.recv_into(buf, len(buf))
ConnectionResetError: [Errno 104] Connection reset by peer
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/root/miniconda3/lib/python3.9/site-packages/distributed/worker.py", line 2741, in gather_dep
response = await get_data_from_worker(
File "/root/miniconda3/lib/python3.9/site-packages/distributed/worker.py", line 4031, in get_data_from_worker
return await retry_operation(_get_data, operation="get_data_from_worker")
File "/root/miniconda3/lib/python3.9/site-packages/distributed/utils_comm.py", line 385, in retry_operation
return await retry(
File "/root/miniconda3/lib/python3.9/site-packages/distributed/utils_comm.py", line 370, in retry
return await coro()
File "/root/miniconda3/lib/python3.9/site-packages/distributed/worker.py", line 4011, in _get_data
response = await send_recv(
File "/root/miniconda3/lib/python3.9/site-packages/distributed/core.py", line 640, in send_recv
response = await comm.read(deserializers=deserializers)
File "/root/miniconda3/lib/python3.9/site-packages/distributed/comm/tcp.py", line 221, in read
convert_stream_closed_error(self, e)
File "/root/miniconda3/lib/python3.9/site-packages/distributed/comm/tcp.py", line 126, in convert_stream_closed_error
raise CommClosedError(f"in {obj}: {exc.__class__.__name__}: {exc}") from exc
distributed.comm.core.CommClosedError: in <TCP (closed) Ephemeral Worker->Worker for gather local=tcp://127.0.0.1:46346 remote=tcp://127.0.0.1:43515>: ConnectionResetError: [Errno 104] Connection reset by peer
distributed.worker - ERROR - failed during get data with tcp://127.0.0.1:43515 -> tcp://127.0.0.1:37667
Traceback (most recent call last):
File "/root/miniconda3/lib/python3.9/site-packages/tornado/iostream.py", line 867, in _read_to_buffer
bytes_read = self.read_from_fd(buf)
File "/root/miniconda3/lib/python3.9/site-packages/tornado/iostream.py", line 1140, in read_from_fd
return self.socket.recv_into(buf, len(buf))
TimeoutError: [Errno 110] Connection timed out
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/root/miniconda3/lib/python3.9/site-packages/distributed/worker.py", line 1649, in get_data
response = await comm.read(deserializers=serializers)
File "/root/miniconda3/lib/python3.9/site-packages/distributed/comm/tcp.py", line 221, in read
convert_stream_closed_error(self, e)
File "/root/miniconda3/lib/python3.9/site-packages/distributed/comm/tcp.py", line 126, in convert_stream_closed_error
raise CommClosedError(f"in {obj}: {exc.__class__.__name__}: {exc}") from exc
distributed.comm.core.CommClosedError: in <TCP (closed) local=tcp://127.0.0.1:43515 remote=tcp://127.0.0.1:46346>: TimeoutError: [Errno 110] Connection timed out
distributed.worker - ERROR - failed during get data with tcp://127.0.0.1:37275 -> tcp://127.0.0.1:34511
Traceback (most recent call last):
File "/root/miniconda3/lib/python3.9/site-packages/tornado/iostream.py", line 867, in _read_to_buffer
bytes_read = self.read_from_fd(buf)
File "/root/miniconda3/lib/python3.9/site-packages/tornado/iostream.py", line 1140, in read_from_fd
return self.socket.recv_into(buf, len(buf))
TimeoutError: [Errno 110] Connection timed out
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/root/miniconda3/lib/python3.9/site-packages/distributed/worker.py", line 1649, in get_data
response = await comm.read(deserializers=serializers)
File "/root/miniconda3/lib/python3.9/site-packages/distributed/comm/tcp.py", line 221, in read
convert_stream_closed_error(self, e)
File "/root/miniconda3/lib/python3.9/site-packages/distributed/comm/tcp.py", line 126, in convert_stream_closed_error
raise CommClosedError(f"in {obj}: {exc.__class__.__name__}: {exc}") from exc
distributed.comm.core.CommClosedError: in <TCP (closed) local=tcp://127.0.0.1:37275 remote=tcp://127.0.0.1:40650>: TimeoutError: [Errno 110] Connection timed out
distributed.worker - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker.html#memtrim for more information. -- Unmanaged memory: 29.25 GiB -- Worker memory limit: 46.57 GiB
distributed.worker - ERROR - failed during get data with tcp://127.0.0.1:37667 -> tcp://127.0.0.1:36281
Traceback (most recent call last):
File "/root/miniconda3/lib/python3.9/site-packages/tornado/iostream.py", line 867, in _read_to_buffer
bytes_read = self.read_from_fd(buf)
File "/root/miniconda3/lib/python3.9/site-packages/tornado/iostream.py", line 1140, in read_from_fd
return self.socket.recv_into(buf, len(buf))
TimeoutError: [Errno 110] Connection timed out
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/root/miniconda3/lib/python3.9/site-packages/distributed/worker.py", line 1649, in get_data
response = await comm.read(deserializers=serializers)
File "/root/miniconda3/lib/python3.9/site-packages/distributed/comm/tcp.py", line 221, in read
convert_stream_closed_error(self, e)
File "/root/miniconda3/lib/python3.9/site-packages/distributed/comm/tcp.py", line 126, in convert_stream_closed_error
raise CommClosedError(f"in {obj}: {exc.__class__.__name__}: {exc}") from exc
distributed.comm.core.CommClosedError: in <TCP (closed) local=tcp://127.0.0.1:37667 remote=tcp://127.0.0.1:37070>: TimeoutError: [Errno 110] Connection timed out
distributed.worker - ERROR - Worker stream died during communication: tcp://127.0.0.1:37667
Traceback (most recent call last):
File "/root/miniconda3/lib/python3.9/site-packages/tornado/iostream.py", line 867, in _read_to_buffer
bytes_read = self.read_from_fd(buf)
File "/root/miniconda3/lib/python3.9/site-packages/tornado/iostream.py", line 1140, in read_from_fd
return self.socket.recv_into(buf, len(buf))
ConnectionResetError: [Errno 104] Connection reset by peer
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/root/miniconda3/lib/python3.9/site-packages/distributed/worker.py", line 2741, in gather_dep
response = await get_data_from_worker(
File "/root/miniconda3/lib/python3.9/site-packages/distributed/worker.py", line 4031, in get_data_from_worker
return await retry_operation(_get_data, operation="get_data_from_worker")
File "/root/miniconda3/lib/python3.9/site-packages/distributed/utils_comm.py", line 385, in retry_operation
return await retry(
File "/root/miniconda3/lib/python3.9/site-packages/distributed/utils_comm.py", line 370, in retry
return await coro()
File "/root/miniconda3/lib/python3.9/site-packages/distributed/worker.py", line 4011, in _get_data
response = await send_recv(
File "/root/miniconda3/lib/python3.9/site-packages/distributed/core.py", line 640, in send_recv
response = await comm.read(deserializers=deserializers)
File "/root/miniconda3/lib/python3.9/site-packages/distributed/comm/tcp.py", line 221, in read
convert_stream_closed_error(self, e)
File "/root/miniconda3/lib/python3.9/site-packages/distributed/comm/tcp.py", line 126, in convert_stream_closed_error
raise CommClosedError(f"in {obj}: {exc.__class__.__name__}: {exc}") from exc
distributed.comm.core.CommClosedError: in <TCP (closed) Ephemeral Worker->Worker for gather local=tcp://127.0.0.1:37070 remote=tcp://127.0.0.1:37667>: ConnectionResetError: [Errno 104] Connection reset by peer
distributed.worker - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker.html#memtrim for more information. -- Unmanaged memory: 33.46 GiB -- Worker memory limit: 46.57 GiB
distributed.worker - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker.html#memtrim for more information. -- Unmanaged memory: 33.46 GiB -- Worker memory limit: 46.57 GiB
distributed.worker - WARNING - Compute Failed
Function: unpack
args: (<dask.dataframe.shuffle_service.ShuffleService object at 0x7f4627812d00>, 240, None)
kwargs: {}
Exception: 'AttributeError("\'ShuffleService\' object has no attribute \'retrieve_futures\'")'
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
/tmp/ipykernel_815540/1044375382.py in <module>
----> 1 data.to_parquet('cache-imported.gz.parq', compression='gzip')
~/miniconda3/lib/python3.9/site-packages/dask/dataframe/core.py in to_parquet(self, path, *args, **kwargs)
4558 from .io import to_parquet
4559
-> 4560 return to_parquet(self, path, *args, **kwargs)
4561
4562 def to_orc(self, path, *args, **kwargs):
~/miniconda3/lib/python3.9/site-packages/dask/dataframe/io/parquet/core.py in to_parquet(df, path, engine, compression, write_index, append, overwrite, ignore_divisions, partition_on, storage_options, custom_metadata, write_metadata_file, compute, compute_kwargs, schema, **kwargs)
723 if compute:
724 if write_metadata_file:
--> 725 return compute_as_if_collection(
726 DataFrame, graph, (final_name, 0), **compute_kwargs
727 )
~/miniconda3/lib/python3.9/site-packages/dask/base.py in compute_as_if_collection(cls, dsk, keys, scheduler, get, **kwargs)
313 schedule = get_scheduler(scheduler=scheduler, cls=cls, get=get)
314 dsk2 = optimization_function(cls)(dsk, keys, **kwargs)
--> 315 return schedule(dsk2, keys, **kwargs)
316
317
~/miniconda3/lib/python3.9/site-packages/distributed/client.py in get(self, dsk, keys, workers, allow_other_workers, resources, sync, asynchronous, direct, retries, priority, fifo_timeout, actors, **kwargs)
2691 should_rejoin = False
2692 try:
-> 2693 results = self.gather(packed, asynchronous=asynchronous, direct=direct)
2694 finally:
2695 for f in futures.values():
~/miniconda3/lib/python3.9/site-packages/distributed/client.py in gather(self, futures, errors, direct, asynchronous)
1967 else:
1968 local_worker = None
-> 1969 return self.sync(
1970 self._gather,
1971 futures,
~/miniconda3/lib/python3.9/site-packages/distributed/client.py in sync(self, func, asynchronous, callback_timeout, *args, **kwargs)
863 return future
864 else:
--> 865 return sync(
866 self.loop, func, *args, callback_timeout=callback_timeout, **kwargs
867 )
~/miniconda3/lib/python3.9/site-packages/distributed/utils.py in sync(loop, func, callback_timeout, *args, **kwargs)
325 if error[0]:
326 typ, exc, tb = error[0]
--> 327 raise exc.with_traceback(tb)
328 else:
329 return result[0]
~/miniconda3/lib/python3.9/site-packages/distributed/utils.py in f()
308 if callback_timeout is not None:
309 future = asyncio.wait_for(future, callback_timeout)
--> 310 result[0] = yield future
311 except Exception:
312 error[0] = sys.exc_info()
~/miniconda3/lib/python3.9/site-packages/tornado/gen.py in run(self)
760
761 try:
--> 762 value = future.result()
763 except Exception:
764 exc_info = sys.exc_info()
~/miniconda3/lib/python3.9/site-packages/distributed/client.py in _gather(self, futures, errors, direct, local_worker)
1832 exc = CancelledError(key)
1833 else:
-> 1834 raise exception.with_traceback(traceback)
1835 raise exc
1836 if errors == "skip":
~/miniconda3/lib/python3.9/site-packages/dask/dataframe/shuffle_service.py in unpack()
49 This is a task in the task graph
50 """
---> 51 return service.get(i)
52
53
~/miniconda3/lib/python3.9/site-packages/dask/dataframe/shuffle_service.py in get()
726 # ^ this causes the `process_received_shards` threads to flush any data out
727 # of `worker.shuffler`, then stop.
--> 728 concurrent.futures.wait(self.retrieve_futures)
729 self.retrieve_futures.clear()
730
AttributeError: 'ShuffleService' object has no attribute 'retrieve_futures'
Environment
- Dask version:
'2021.09.1+26.gfd1b02b6'
(#8223) - Pandas:
1.3.1
- Python version:
3.9.5
- Operating System:
Ubuntu 18.04.5 LTS
- Install method (conda, pip, source):
conda
Issue Analytics
- State:
- Created 2 years ago
- Comments:6 (4 by maintainers)
Top Results From Across the Web
(old) Unofficial List of Questions, Clarifications, and Typos
Q: How do equipment cards Zipline and Jetpack work? ... This means that unless you're playing Deadly Space mode, or the Evac Log...
Read more >Kickstarter Tabletop Alert: Do the 'Echidna Shuffle' - GeekDad
Note: My review is based on a prototype, so final component quality ... You may move any echidna on the board if there...
Read more >Exoshuffle: Large-Scale Shuffle at the Application Level
To reduce latency, training workers must be able to consume shuffled outputs as they ... Current data warehouses primarily use hard disk drives...
Read more >(PDF) Exoshuffle: Large-Scale Shuffle at the Application Level
PDF | Shuffle is a key primitive in large-scale data processing applications. ... Current data warehouses primarily use hard disk drives.
Read more >The iPod is dead - TechCrunch
The iPod's death has been a protracted one. I can hear those “the iPod was still around?” posts clogging up the comments section...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Also, this blogpost might also be interesting if you haven’t had a chance to read it: https://coiled.io/blog/better-shuffling-in-dask-a-proof-of-concept/
Hi @DahnJ
You might have already seen this related discussion about how the shuflle service could potentially be redesigned to be more robust. Link to document.
It’s great to see people trying this stuff out