question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Shuffle prototype: Feedback (disk usage + workers dying)

See original GitHub issue

I have tried #8223 on a ~3.4TB gzipped Parquet dataset.

I tried four runs so far, with two different behaviours

  • First I tried the whole dataset. I got to the last step (to_parquet), but then ran out of disk space with over 4TB being used by the shuffle
  • The subsequent three runs I have tried to use a subset (by passing a smaller number of files to read_parquet). I did not get over the set_index step. Workers seem to die and the computation hangs, always towards the end. Errors below.

Based on the first run, this does look like it could’ve actually been successful if I had more disk space. That’s quite exciting, as external sorting has been a big issue for me.

Code used

Sadly I cannot share the data. I am at least sharing the code I’m using.

from dask.distributed import Client, LocalCluster
from pathlib import Path

cluster = LocalCluster(n_workers=10, memory_limit='25GB', threads_per_worker=2)
client = Client(cluster)
paths_files = list(Path('.').glob('*/*.parquet'))
columns = ['col_a', 'col_b']

data = dd.read_parquet(paths_files, columns=columns)

# Here is where workers fail and computation hangs (3 out of 4 tries with a subset)
data = data.set_index('col_a', shuffle="service")
# Here is where it ran out of disk space (1 out of 4 tries)
data.to_parquet('cache-imported.gz.parq', compression='gzip')

Errors

Orginal error
distributed.worker - ERROR - Worker stream died during communication: tcp://127.0.0.1:38001
Traceback (most recent call last):
  File "/root/miniconda3/lib/python3.9/site-packages/distributed/comm/tcp.py", line 205, in read
    frames_nbytes = await stream.read_bytes(fmt_size)
asyncio.exceptions.CancelledError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/root/miniconda3/lib/python3.9/asyncio/tasks.py", line 492, in wait_for
    fut.result()
asyncio.exceptions.CancelledError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/root/miniconda3/lib/python3.9/site-packages/distributed/comm/core.py", line 319, in connect
    handshake = await asyncio.wait_for(comm.read(), time_left())
  File "/root/miniconda3/lib/python3.9/asyncio/tasks.py", line 494, in wait_for
    raise exceptions.TimeoutError() from exc
asyncio.exceptions.TimeoutError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/root/miniconda3/lib/python3.9/site-packages/distributed/worker.py", line 2654, in gather_dep
    response = await get_data_from_worker(
  File "/root/miniconda3/lib/python3.9/site-packages/distributed/worker.py", line 3982, in get_data_from_worker
    return await retry_operation(_get_data, operation="get_data_from_worker")
  File "/root/miniconda3/lib/python3.9/site-packages/distributed/utils_comm.py", line 385, in retry_operation
    return await retry(
  File "/root/miniconda3/lib/python3.9/site-packages/distributed/utils_comm.py", line 370, in retry
    return await coro()
  File "/root/miniconda3/lib/python3.9/site-packages/distributed/worker.py", line 3959, in _get_data
    comm = await rpc.connect(worker)
  File "/root/miniconda3/lib/python3.9/site-packages/distributed/core.py", line 1048, in connect
    raise exc
  File "/root/miniconda3/lib/python3.9/site-packages/distributed/core.py", line 1032, in connect
    comm = await fut
  File "/root/miniconda3/lib/python3.9/site-packages/distributed/comm/core.py", line 324, in connect
    raise OSError(
OSError: Timed out during handshake while connecting to tcp://127.0.0.1:38001 after 30 s
distributed.worker - ERROR - Worker stream died during communication: tcp://127.0.0.1:38215
Traceback (most recent call last):
  File "/root/miniconda3/lib/python3.9/site-packages/distributed/comm/tcp.py", line 245, in write
    async def write(self, msg, serializers=None, on_error="message"):
asyncio.exceptions.CancelledError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/root/miniconda3/lib/python3.9/asyncio/tasks.py", line 452, in wait_for
    fut.result()
asyncio.exceptions.CancelledError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/root/miniconda3/lib/python3.9/site-packages/distributed/comm/core.py", line 320, in connect
    await asyncio.wait_for(comm.write(local_info), time_left())
  File "/root/miniconda3/lib/python3.9/asyncio/tasks.py", line 454, in wait_for
    raise exceptions.TimeoutError() from exc
asyncio.exceptions.TimeoutError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/root/miniconda3/lib/python3.9/site-packages/distributed/worker.py", line 2654, in gather_dep
    response = await get_data_from_worker(
  File "/root/miniconda3/lib/python3.9/site-packages/distributed/worker.py", line 3982, in get_data_from_worker
    return await retry_operation(_get_data, operation="get_data_from_worker")
  File "/root/miniconda3/lib/python3.9/site-packages/distributed/utils_comm.py", line 385, in retry_operation
    return await retry(
  File "/root/miniconda3/lib/python3.9/site-packages/distributed/utils_comm.py", line 370, in retry
    return await coro()
  File "/root/miniconda3/lib/python3.9/site-packages/distributed/worker.py", line 3959, in _get_data
    comm = await rpc.connect(worker)
  File "/root/miniconda3/lib/python3.9/site-packages/distributed/core.py", line 1048, in connect
    raise exc
  File "/root/miniconda3/lib/python3.9/site-packages/distributed/core.py", line 1032, in connect
    comm = await fut
  File "/root/miniconda3/lib/python3.9/site-packages/distributed/comm/core.py", line 324, in connect
    raise OSError(
OSError: Timed out during handshake while connecting to tcp://127.0.0.1:38215 after 30 s
distributed.worker - ERROR - Worker stream died during communication: tcp://127.0.0.1:39419
Traceback (most recent call last):
  File "/root/miniconda3/lib/python3.9/site-packages/distributed/comm/tcp.py", line 245, in write
    async def write(self, msg, serializers=None, on_error="message"):
asyncio.exceptions.CancelledError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/root/miniconda3/lib/python3.9/asyncio/tasks.py", line 452, in wait_for
    fut.result()
asyncio.exceptions.CancelledError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/root/miniconda3/lib/python3.9/site-packages/distributed/comm/core.py", line 320, in connect
    await asyncio.wait_for(comm.write(local_info), time_left())
  File "/root/miniconda3/lib/python3.9/asyncio/tasks.py", line 454, in wait_for
    raise exceptions.TimeoutError() from exc
asyncio.exceptions.TimeoutError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/root/miniconda3/lib/python3.9/site-packages/distributed/worker.py", line 2654, in gather_dep
    response = await get_data_from_worker(
  File "/root/miniconda3/lib/python3.9/site-packages/distributed/worker.py", line 3982, in get_data_from_worker
    return await retry_operation(_get_data, operation="get_data_from_worker")
  File "/root/miniconda3/lib/python3.9/site-packages/distributed/utils_comm.py", line 385, in retry_operation
    return await retry(
  File "/root/miniconda3/lib/python3.9/site-packages/distributed/utils_comm.py", line 370, in retry
    return await coro()
  File "/root/miniconda3/lib/python3.9/site-packages/distributed/worker.py", line 3959, in _get_data
    comm = await rpc.connect(worker)
  File "/root/miniconda3/lib/python3.9/site-packages/distributed/core.py", line 1048, in connect
    raise exc
  File "/root/miniconda3/lib/python3.9/site-packages/distributed/core.py", line 1032, in connect
    comm = await fut
  File "/root/miniconda3/lib/python3.9/site-packages/distributed/comm/core.py", line 324, in connect
    raise OSError(
OSError: Timed out during handshake while connecting to tcp://127.0.0.1:39419 after 30 s
distributed.worker - ERROR - Worker stream died during communication: tcp://127.0.0.1:42233
Traceback (most recent call last):
  File "/root/miniconda3/lib/python3.9/site-packages/distributed/comm/tcp.py", line 245, in write
    async def write(self, msg, serializers=None, on_error="message"):
asyncio.exceptions.CancelledError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/root/miniconda3/lib/python3.9/asyncio/tasks.py", line 452, in wait_for
    fut.result()
asyncio.exceptions.CancelledError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/root/miniconda3/lib/python3.9/site-packages/distributed/comm/core.py", line 320, in connect
    await asyncio.wait_for(comm.write(local_info), time_left())
  File "/root/miniconda3/lib/python3.9/asyncio/tasks.py", line 454, in wait_for
    raise exceptions.TimeoutError() from exc
asyncio.exceptions.TimeoutError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/root/miniconda3/lib/python3.9/site-packages/distributed/worker.py", line 2654, in gather_dep
    response = await get_data_from_worker(
  File "/root/miniconda3/lib/python3.9/site-packages/distributed/worker.py", line 3982, in get_data_from_worker
    return await retry_operation(_get_data, operation="get_data_from_worker")
  File "/root/miniconda3/lib/python3.9/site-packages/distributed/utils_comm.py", line 385, in retry_operation
    return await retry(
  File "/root/miniconda3/lib/python3.9/site-packages/distributed/utils_comm.py", line 370, in retry
    return await coro()
  File "/root/miniconda3/lib/python3.9/site-packages/distributed/worker.py", line 3959, in _get_data
    comm = await rpc.connect(worker)
  File "/root/miniconda3/lib/python3.9/site-packages/distributed/core.py", line 1048, in connect
    raise exc
  File "/root/miniconda3/lib/python3.9/site-packages/distributed/core.py", line 1032, in connect
    comm = await fut
  File "/root/miniconda3/lib/python3.9/site-packages/distributed/comm/core.py", line 324, in connect
    raise OSError(
OSError: Timed out during handshake while connecting to tcp://127.0.0.1:42233 after 30 s
distributed.worker - ERROR - Worker stream died during communication: tcp://127.0.0.1:33333
Traceback (most recent call last):
  File "/root/miniconda3/lib/python3.9/site-packages/distributed/comm/tcp.py", line 245, in write
    async def write(self, msg, serializers=None, on_error="message"):
asyncio.exceptions.CancelledError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/root/miniconda3/lib/python3.9/asyncio/tasks.py", line 452, in wait_for
    fut.result()
asyncio.exceptions.CancelledError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/root/miniconda3/lib/python3.9/site-packages/distributed/comm/core.py", line 320, in connect
    await asyncio.wait_for(comm.write(local_info), time_left())
  File "/root/miniconda3/lib/python3.9/asyncio/tasks.py", line 454, in wait_for
    raise exceptions.TimeoutError() from exc
asyncio.exceptions.TimeoutError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/root/miniconda3/lib/python3.9/site-packages/distributed/worker.py", line 2654, in gather_dep
    response = await get_data_from_worker(
  File "/root/miniconda3/lib/python3.9/site-packages/distributed/worker.py", line 3982, in get_data_from_worker
    return await retry_operation(_get_data, operation="get_data_from_worker")
  File "/root/miniconda3/lib/python3.9/site-packages/distributed/utils_comm.py", line 385, in retry_operation
    return await retry(
  File "/root/miniconda3/lib/python3.9/site-packages/distributed/utils_comm.py", line 370, in retry
    return await coro()
  File "/root/miniconda3/lib/python3.9/site-packages/distributed/worker.py", line 3959, in _get_data
    comm = await rpc.connect(worker)
  File "/root/miniconda3/lib/python3.9/site-packages/distributed/core.py", line 1048, in connect
    raise exc
  File "/root/miniconda3/lib/python3.9/site-packages/distributed/core.py", line 1032, in connect
    comm = await fut
  File "/root/miniconda3/lib/python3.9/site-packages/distributed/comm/core.py", line 324, in connect
    raise OSError(
OSError: Timed out during handshake while connecting to tcp://127.0.0.1:33333 after 30 s
distributed.worker - ERROR - Worker stream died during communication: tcp://127.0.0.1:38215
Traceback (most recent call last):
  File "/root/miniconda3/lib/python3.9/site-packages/distributed/comm/tcp.py", line 245, in write
    async def write(self, msg, serializers=None, on_error="message"):
asyncio.exceptions.CancelledError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/root/miniconda3/lib/python3.9/asyncio/tasks.py", line 452, in wait_for
    fut.result()
asyncio.exceptions.CancelledError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/root/miniconda3/lib/python3.9/site-packages/distributed/comm/core.py", line 320, in connect
    await asyncio.wait_for(comm.write(local_info), time_left())
  File "/root/miniconda3/lib/python3.9/asyncio/tasks.py", line 454, in wait_for
    raise exceptions.TimeoutError() from exc
asyncio.exceptions.TimeoutError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/root/miniconda3/lib/python3.9/site-packages/distributed/worker.py", line 2654, in gather_dep
    response = await get_data_from_worker(
  File "/root/miniconda3/lib/python3.9/site-packages/distributed/worker.py", line 3982, in get_data_from_worker
    return await retry_operation(_get_data, operation="get_data_from_worker")
  File "/root/miniconda3/lib/python3.9/site-packages/distributed/utils_comm.py", line 385, in retry_operation
    return await retry(
  File "/root/miniconda3/lib/python3.9/site-packages/distributed/utils_comm.py", line 370, in retry
    return await coro()
  File "/root/miniconda3/lib/python3.9/site-packages/distributed/worker.py", line 3959, in _get_data
    comm = await rpc.connect(worker)
  File "/root/miniconda3/lib/python3.9/site-packages/distributed/core.py", line 1048, in connect
    raise exc
  File "/root/miniconda3/lib/python3.9/site-packages/distributed/core.py", line 1032, in connect
    comm = await fut
  File "/root/miniconda3/lib/python3.9/site-packages/distributed/comm/core.py", line 324, in connect
    raise OSError(
OSError: Timed out during handshake while connecting to tcp://127.0.0.1:38215 after 30 s
distributed.worker - ERROR - Worker stream died during communication: tcp://127.0.0.1:33333
Traceback (most recent call last):
  File "/root/miniconda3/lib/python3.9/site-packages/distributed/comm/tcp.py", line 245, in write
    async def write(self, msg, serializers=None, on_error="message"):
asyncio.exceptions.CancelledError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/root/miniconda3/lib/python3.9/asyncio/tasks.py", line 452, in wait_for
    fut.result()
asyncio.exceptions.CancelledError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/root/miniconda3/lib/python3.9/site-packages/distributed/comm/core.py", line 320, in connect
    await asyncio.wait_for(comm.write(local_info), time_left())
  File "/root/miniconda3/lib/python3.9/asyncio/tasks.py", line 454, in wait_for
    raise exceptions.TimeoutError() from exc
asyncio.exceptions.TimeoutError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/root/miniconda3/lib/python3.9/site-packages/distributed/worker.py", line 2654, in gather_dep
    response = await get_data_from_worker(
  File "/root/miniconda3/lib/python3.9/site-packages/distributed/worker.py", line 3982, in get_data_from_worker
    return await retry_operation(_get_data, operation="get_data_from_worker")
  File "/root/miniconda3/lib/python3.9/site-packages/distributed/utils_comm.py", line 385, in retry_operation
    return await retry(
  File "/root/miniconda3/lib/python3.9/site-packages/distributed/utils_comm.py", line 370, in retry
    return await coro()
  File "/root/miniconda3/lib/python3.9/site-packages/distributed/worker.py", line 3959, in _get_data
    comm = await rpc.connect(worker)
  File "/root/miniconda3/lib/python3.9/site-packages/distributed/core.py", line 1048, in connect
    raise exc
  File "/root/miniconda3/lib/python3.9/site-packages/distributed/core.py", line 1032, in connect
    comm = await fut
  File "/root/miniconda3/lib/python3.9/site-packages/distributed/comm/core.py", line 324, in connect
    raise OSError(
OSError: Timed out during handshake while connecting to tcp://127.0.0.1:33333 after 30 s
distributed.worker - ERROR - Worker stream died during communication: tcp://127.0.0.1:33333
Traceback (most recent call last):
  File "/root/miniconda3/lib/python3.9/site-packages/distributed/comm/tcp.py", line 245, in write
    async def write(self, msg, serializers=None, on_error="message"):
asyncio.exceptions.CancelledError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/root/miniconda3/lib/python3.9/asyncio/tasks.py", line 452, in wait_for
    fut.result()
asyncio.exceptions.CancelledError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/root/miniconda3/lib/python3.9/site-packages/distributed/comm/core.py", line 320, in connect
    await asyncio.wait_for(comm.write(local_info), time_left())
  File "/root/miniconda3/lib/python3.9/asyncio/tasks.py", line 454, in wait_for
    raise exceptions.TimeoutError() from exc
asyncio.exceptions.TimeoutError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/root/miniconda3/lib/python3.9/site-packages/distributed/worker.py", line 2654, in gather_dep
    response = await get_data_from_worker(
  File "/root/miniconda3/lib/python3.9/site-packages/distributed/worker.py", line 3982, in get_data_from_worker
    return await retry_operation(_get_data, operation="get_data_from_worker")
  File "/root/miniconda3/lib/python3.9/site-packages/distributed/utils_comm.py", line 385, in retry_operation
    return await retry(
  File "/root/miniconda3/lib/python3.9/site-packages/distributed/utils_comm.py", line 370, in retry
    return await coro()
  File "/root/miniconda3/lib/python3.9/site-packages/distributed/worker.py", line 3959, in _get_data
    comm = await rpc.connect(worker)
  File "/root/miniconda3/lib/python3.9/site-packages/distributed/core.py", line 1048, in connect
    raise exc
  File "/root/miniconda3/lib/python3.9/site-packages/distributed/core.py", line 1032, in connect
    comm = await fut
  File "/root/miniconda3/lib/python3.9/site-packages/distributed/comm/core.py", line 324, in connect
    raise OSError(
OSError: Timed out during handshake while connecting to tcp://127.0.0.1:33333 after 30 s
distributed.worker - ERROR - Worker stream died during communication: tcp://127.0.0.1:36893
Traceback (most recent call last):
  File "/root/miniconda3/lib/python3.9/site-packages/distributed/comm/tcp.py", line 205, in read
    frames_nbytes = await stream.read_bytes(fmt_size)
asyncio.exceptions.CancelledError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/root/miniconda3/lib/python3.9/asyncio/tasks.py", line 492, in wait_for
    fut.result()
asyncio.exceptions.CancelledError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/root/miniconda3/lib/python3.9/site-packages/distributed/comm/core.py", line 319, in connect
    handshake = await asyncio.wait_for(comm.read(), time_left())
  File "/root/miniconda3/lib/python3.9/asyncio/tasks.py", line 494, in wait_for
    raise exceptions.TimeoutError() from exc
asyncio.exceptions.TimeoutError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/root/miniconda3/lib/python3.9/site-packages/distributed/worker.py", line 2654, in gather_dep
    response = await get_data_from_worker(
  File "/root/miniconda3/lib/python3.9/site-packages/distributed/worker.py", line 3982, in get_data_from_worker
    return await retry_operation(_get_data, operation="get_data_from_worker")
  File "/root/miniconda3/lib/python3.9/site-packages/distributed/utils_comm.py", line 385, in retry_operation
    return await retry(
  File "/root/miniconda3/lib/python3.9/site-packages/distributed/utils_comm.py", line 370, in retry
    return await coro()
  File "/root/miniconda3/lib/python3.9/site-packages/distributed/worker.py", line 3959, in _get_data
    comm = await rpc.connect(worker)
  File "/root/miniconda3/lib/python3.9/site-packages/distributed/core.py", line 1048, in connect
    raise exc
  File "/root/miniconda3/lib/python3.9/site-packages/distributed/core.py", line 1032, in connect
    comm = await fut
  File "/root/miniconda3/lib/python3.9/site-packages/distributed/comm/core.py", line 324, in connect
    raise OSError(
OSError: Timed out during handshake while connecting to tcp://127.0.0.1:36893 after 30 s
distributed.worker - ERROR - Worker stream died during communication: tcp://127.0.0.1:41795
Traceback (most recent call last):
  File "/root/miniconda3/lib/python3.9/site-packages/distributed/comm/tcp.py", line 205, in read
    frames_nbytes = await stream.read_bytes(fmt_size)
asyncio.exceptions.CancelledError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/root/miniconda3/lib/python3.9/asyncio/tasks.py", line 492, in wait_for
    fut.result()
asyncio.exceptions.CancelledError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/root/miniconda3/lib/python3.9/site-packages/distributed/comm/core.py", line 319, in connect
    handshake = await asyncio.wait_for(comm.read(), time_left())
  File "/root/miniconda3/lib/python3.9/asyncio/tasks.py", line 494, in wait_for
    raise exceptions.TimeoutError() from exc
asyncio.exceptions.TimeoutError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/root/miniconda3/lib/python3.9/site-packages/distributed/worker.py", line 2654, in gather_dep
    response = await get_data_from_worker(
  File "/root/miniconda3/lib/python3.9/site-packages/distributed/worker.py", line 3982, in get_data_from_worker
    return await retry_operation(_get_data, operation="get_data_from_worker")
  File "/root/miniconda3/lib/python3.9/site-packages/distributed/utils_comm.py", line 385, in retry_operation
    return await retry(
  File "/root/miniconda3/lib/python3.9/site-packages/distributed/utils_comm.py", line 370, in retry
    return await coro()
  File "/root/miniconda3/lib/python3.9/site-packages/distributed/worker.py", line 3959, in _get_data
    comm = await rpc.connect(worker)
  File "/root/miniconda3/lib/python3.9/site-packages/distributed/core.py", line 1048, in connect
    raise exc
  File "/root/miniconda3/lib/python3.9/site-packages/distributed/core.py", line 1032, in connect
    comm = await fut
  File "/root/miniconda3/lib/python3.9/site-packages/distributed/comm/core.py", line 324, in connect
    raise OSError(
OSError: Timed out during handshake while connecting to tcp://127.0.0.1:41795 after 30 s
distributed.worker - ERROR - Worker stream died during communication: tcp://127.0.0.1:41795
Traceback (most recent call last):
  File "/root/miniconda3/lib/python3.9/site-packages/distributed/comm/tcp.py", line 245, in write
    async def write(self, msg, serializers=None, on_error="message"):
asyncio.exceptions.CancelledError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/root/miniconda3/lib/python3.9/asyncio/tasks.py", line 452, in wait_for
    fut.result()
asyncio.exceptions.CancelledError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/root/miniconda3/lib/python3.9/site-packages/distributed/comm/core.py", line 320, in connect
    await asyncio.wait_for(comm.write(local_info), time_left())
  File "/root/miniconda3/lib/python3.9/asyncio/tasks.py", line 454, in wait_for
    raise exceptions.TimeoutError() from exc
asyncio.exceptions.TimeoutError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/root/miniconda3/lib/python3.9/site-packages/distributed/worker.py", line 2654, in gather_dep
    response = await get_data_from_worker(
  File "/root/miniconda3/lib/python3.9/site-packages/distributed/worker.py", line 3982, in get_data_from_worker
    return await retry_operation(_get_data, operation="get_data_from_worker")
  File "/root/miniconda3/lib/python3.9/site-packages/distributed/utils_comm.py", line 385, in retry_operation
    return await retry(
  File "/root/miniconda3/lib/python3.9/site-packages/distributed/utils_comm.py", line 370, in retry
    return await coro()
  File "/root/miniconda3/lib/python3.9/site-packages/distributed/worker.py", line 3959, in _get_data
    comm = await rpc.connect(worker)
  File "/root/miniconda3/lib/python3.9/site-packages/distributed/core.py", line 1048, in connect
    raise exc
  File "/root/miniconda3/lib/python3.9/site-packages/distributed/core.py", line 1032, in connect
    comm = await fut
  File "/root/miniconda3/lib/python3.9/site-packages/distributed/comm/core.py", line 324, in connect
    raise OSError(
OSError: Timed out during handshake while connecting to tcp://127.0.0.1:41795 after 30 s
distributed.worker - ERROR - Worker stream died during communication: tcp://127.0.0.1:41795
Traceback (most recent call last):
  File "/root/miniconda3/lib/python3.9/site-packages/distributed/comm/tcp.py", line 245, in write
    async def write(self, msg, serializers=None, on_error="message"):
asyncio.exceptions.CancelledError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/root/miniconda3/lib/python3.9/asyncio/tasks.py", line 452, in wait_for
    fut.result()
asyncio.exceptions.CancelledError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/root/miniconda3/lib/python3.9/site-packages/distributed/comm/core.py", line 320, in connect
    await asyncio.wait_for(comm.write(local_info), time_left())
  File "/root/miniconda3/lib/python3.9/asyncio/tasks.py", line 454, in wait_for
    raise exceptions.TimeoutError() from exc
asyncio.exceptions.TimeoutError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/root/miniconda3/lib/python3.9/site-packages/distributed/worker.py", line 2654, in gather_dep
    response = await get_data_from_worker(
  File "/root/miniconda3/lib/python3.9/site-packages/distributed/worker.py", line 3982, in get_data_from_worker
    return await retry_operation(_get_data, operation="get_data_from_worker")
  File "/root/miniconda3/lib/python3.9/site-packages/distributed/utils_comm.py", line 385, in retry_operation
    return await retry(
  File "/root/miniconda3/lib/python3.9/site-packages/distributed/utils_comm.py", line 370, in retry
    return await coro()
  File "/root/miniconda3/lib/python3.9/site-packages/distributed/worker.py", line 3959, in _get_data
    comm = await rpc.connect(worker)
  File "/root/miniconda3/lib/python3.9/site-packages/distributed/core.py", line 1048, in connect
    raise exc
  File "/root/miniconda3/lib/python3.9/site-packages/distributed/core.py", line 1032, in connect
    comm = await fut
  File "/root/miniconda3/lib/python3.9/site-packages/distributed/comm/core.py", line 324, in connect
    raise OSError(
OSError: Timed out during handshake while connecting to tcp://127.0.0.1:41795 after 30 s
distributed.worker - ERROR - Worker stream died during communication: tcp://127.0.0.1:38707
Traceback (most recent call last):
  File "/root/miniconda3/lib/python3.9/site-packages/distributed/comm/tcp.py", line 205, in read
    frames_nbytes = await stream.read_bytes(fmt_size)
asyncio.exceptions.CancelledError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/root/miniconda3/lib/python3.9/asyncio/tasks.py", line 492, in wait_for
    fut.result()
asyncio.exceptions.CancelledError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/root/miniconda3/lib/python3.9/site-packages/distributed/comm/core.py", line 319, in connect
    handshake = await asyncio.wait_for(comm.read(), time_left())
  File "/root/miniconda3/lib/python3.9/asyncio/tasks.py", line 494, in wait_for
    raise exceptions.TimeoutError() from exc
asyncio.exceptions.TimeoutError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/root/miniconda3/lib/python3.9/site-packages/distributed/worker.py", line 2654, in gather_dep
    response = await get_data_from_worker(
  File "/root/miniconda3/lib/python3.9/site-packages/distributed/worker.py", line 3982, in get_data_from_worker
    return await retry_operation(_get_data, operation="get_data_from_worker")
  File "/root/miniconda3/lib/python3.9/site-packages/distributed/utils_comm.py", line 385, in retry_operation
    return await retry(
  File "/root/miniconda3/lib/python3.9/site-packages/distributed/utils_comm.py", line 370, in retry
    return await coro()
  File "/root/miniconda3/lib/python3.9/site-packages/distributed/worker.py", line 3959, in _get_data
    comm = await rpc.connect(worker)
  File "/root/miniconda3/lib/python3.9/site-packages/distributed/core.py", line 1048, in connect
    raise exc
  File "/root/miniconda3/lib/python3.9/site-packages/distributed/core.py", line 1032, in connect
    comm = await fut
  File "/root/miniconda3/lib/python3.9/site-packages/distributed/comm/core.py", line 324, in connect
    raise OSError(
OSError: Timed out during handshake while connecting to tcp://127.0.0.1:38707 after 30 s
distributed.utils - ERROR - ("('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)", [("('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)", 'register-replica', 'released', 'compute-task-1635104458.1960478', 1635104468.872155), ("('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)", 'released', 'fetch', {}, 'compute-task-1635104458.1960478', 1635104468.8722844), ('gather-dependencies', 'tcp://127.0.0.1:41795', {"('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)"}, 'stimulus', 1635104468.872853), ("('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)", 'fetch', 'flight', {}, 'ensure-communicating-1635104468.8727045', 1635104468.8728676), ('request-dep', 'tcp://127.0.0.1:41795', {"('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)"}, 'ensure-communicating-1635104468.8727045', 1635104470.384826), ('receive-dep-failed', 'tcp://127.0.0.1:41795', {"('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)"}, 'ensure-communicating-1635104468.8727045', 1635104503.914738), ("('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)", 'missing-dep'), ("('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)", 'flight', 'fetch', {}, 'ensure-communicating-1635104468.8727045', 1635104503.9147604), ("('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)", 'fetch', 'missing', {}, 'ensure-communicating-1635104503.91477', 1635104503.9147818), ("('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)", 'missing', 'fetch', {}, 'find-missing-1635104504.5391095', 1635104504.5398495), ('gather-dependencies', 'tcp://127.0.0.1:41795', {"('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)"}, 'stimulus', 1635104504.5398893), ("('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)", 'fetch', 'flight', {}, 'ensure-communicating-1635104504.5398538', 1635104504.5398993), ('request-dep', 'tcp://127.0.0.1:41795', {"('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)"}, 'ensure-communicating-1635104504.5398538', 1635104504.5399654), ("('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)", 'flight', 'cancelled', {}, 'processing-released-1635104504.5400004', 1635104504.5406826), ("('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)", 'cancelled', 'cancelled', {"('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)": 'released'}, 'processing-released-1635104504.5400004', 1635104504.5406876), ("('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)", 'flight', 'cancelled', {"('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)": 'released'}, 'processing-released-1635104504.5400004', 1635104504.540689), ("('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)", 'release-key', 'processing-released-1635104504.5400004'), ("('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)", 'cancelled', 'released', {"('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)": 'forgotten'}, 'processing-released-1635104504.5400004', 1635104504.5407002), ("('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)", 'released', 'forgotten', {}, 'processing-released-1635104504.5400004', 1635104504.5407038), ('lost-during-gather', {"('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)"}, 'ensure-communicating-1635104504.5398538'), ('receive-dep', 'tcp://127.0.0.1:41795', {"('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)"}, 'ensure-communicating-1635104504.5398538', 1635104548.287821)])
Traceback (most recent call last):
  File "/root/miniconda3/lib/python3.9/site-packages/distributed/utils.py", line 648, in log_errors
    yield
  File "/root/miniconda3/lib/python3.9/site-packages/distributed/worker.py", line 2745, in gather_dep
    assert ts, (d, self.story(d))
AssertionError: ("('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)", [("('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)", 'register-replica', 'released', 'compute-task-1635104458.1960478', 1635104468.872155), ("('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)", 'released', 'fetch', {}, 'compute-task-1635104458.1960478', 1635104468.8722844), ('gather-dependencies', 'tcp://127.0.0.1:41795', {"('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)"}, 'stimulus', 1635104468.872853), ("('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)", 'fetch', 'flight', {}, 'ensure-communicating-1635104468.8727045', 1635104468.8728676), ('request-dep', 'tcp://127.0.0.1:41795', {"('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)"}, 'ensure-communicating-1635104468.8727045', 1635104470.384826), ('receive-dep-failed', 'tcp://127.0.0.1:41795', {"('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)"}, 'ensure-communicating-1635104468.8727045', 1635104503.914738), ("('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)", 'missing-dep'), ("('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)", 'flight', 'fetch', {}, 'ensure-communicating-1635104468.8727045', 1635104503.9147604), ("('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)", 'fetch', 'missing', {}, 'ensure-communicating-1635104503.91477', 1635104503.9147818), ("('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)", 'missing', 'fetch', {}, 'find-missing-1635104504.5391095', 1635104504.5398495), ('gather-dependencies', 'tcp://127.0.0.1:41795', {"('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)"}, 'stimulus', 1635104504.5398893), ("('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)", 'fetch', 'flight', {}, 'ensure-communicating-1635104504.5398538', 1635104504.5398993), ('request-dep', 'tcp://127.0.0.1:41795', {"('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)"}, 'ensure-communicating-1635104504.5398538', 1635104504.5399654), ("('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)", 'flight', 'cancelled', {}, 'processing-released-1635104504.5400004', 1635104504.5406826), ("('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)", 'cancelled', 'cancelled', {"('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)": 'released'}, 'processing-released-1635104504.5400004', 1635104504.5406876), ("('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)", 'flight', 'cancelled', {"('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)": 'released'}, 'processing-released-1635104504.5400004', 1635104504.540689), ("('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)", 'release-key', 'processing-released-1635104504.5400004'), ("('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)", 'cancelled', 'released', {"('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)": 'forgotten'}, 'processing-released-1635104504.5400004', 1635104504.5407002), ("('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)", 'released', 'forgotten', {}, 'processing-released-1635104504.5400004', 1635104504.5407038), ('lost-during-gather', {"('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)"}, 'ensure-communicating-1635104504.5398538'), ('receive-dep', 'tcp://127.0.0.1:41795', {"('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)"}, 'ensure-communicating-1635104504.5398538', 1635104548.287821)])
tornado.application - ERROR - Exception in callback functools.partial(<bound method IOLoop._discard_future_result of <tornado.platform.asyncio.AsyncIOLoop object at 0x7f5f45925ac0>>, <Task finished name='Task-8548' coro=<Worker.gather_dep() done, defined at /root/miniconda3/lib/python3.9/site-packages/distributed/worker.py:2588> exception=AssertionError(("('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)", [("('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)", 'register-replica', 'released', 'compute-task-1635104458.1960478', 1635104468.872155), ("('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)", 'released', 'fetch', {}, 'compute-task-1635104458.1960478', 1635104468.8722844), ('gather-dependencies', 'tcp://127.0.0.1:41795', {"('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)"}, 'stimulus', 1635104468.872853), ("('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)", 'fetch', 'flight', {}, 'ensure-communicating-1635104468.8727045', 1635104468.8728676), ('request-dep', 'tcp://127.0.0.1:41795', {"('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)"}, 'ensure-communicating-1635104468.8727045', 1635104470.384826), ('receive-dep-failed', 'tcp://127.0.0.1:41795', {"('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)"}, 'ensure-communicating-1635104468.8727045', 1635104503.914738), ("('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)", 'missing-dep'), ("('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)", 'flight', 'fetch', {}, 'ensure-communicating-1635104468.8727045', 1635104503.9147604), ("('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)", 'fetch', 'missing', {}, 'ensure-communicating-1635104503.91477', 1635104503.9147818), ("('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)", 'missing', 'fetch', {}, 'find-missing-1635104504.5391095', 1635104504.5398495), ('gather-dependencies', 'tcp://127.0.0.1:41795', {"('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)"}, 'stimulus', 1635104504.5398893), ("('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)", 'fetch', 'flight', {}, 'ensure-communicating-1635104504.5398538', 1635104504.5398993), ('request-dep', 'tcp://127.0.0.1:41795', {"('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)"}, 'ensure-communicating-1635104504.5398538', 1635104504.5399654), ("('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)", 'flight', 'cancelled', {}, 'processing-released-1635104504.5400004', 1635104504.5406826), ("('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)", 'cancelled', 'cancelled', {"('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)": 'released'}, 'processing-released-1635104504.5400004', 1635104504.5406876), ("('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)", 'flight', 'cancelled', {"('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)": 'released'}, 'processing-released-1635104504.5400004', 1635104504.540689), ("('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)", 'release-key', 'processing-released-1635104504.5400004'), ("('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)", 'cancelled', 'released', {"('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)": 'forgotten'}, 'processing-released-1635104504.5400004', 1635104504.5407002), ("('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)", 'released', 'forgotten', {}, 'processing-released-1635104504.5400004', 1635104504.5407038), ('lost-during-gather', {"('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)"}, 'ensure-communicating-1635104504.5398538'), ('receive-dep', 'tcp://127.0.0.1:41795', {"('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)"}, 'ensure-communicating-1635104504.5398538', 1635104548.287821)]))>)
Traceback (most recent call last):
  File "/root/miniconda3/lib/python3.9/site-packages/tornado/ioloop.py", line 741, in _run_callback
    ret = callback()
  File "/root/miniconda3/lib/python3.9/site-packages/tornado/ioloop.py", line 765, in _discard_future_result
    future.result()
  File "/root/miniconda3/lib/python3.9/site-packages/distributed/worker.py", line 2745, in gather_dep
    assert ts, (d, self.story(d))
AssertionError: ("('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)", [("('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)", 'register-replica', 'released', 'compute-task-1635104458.1960478', 1635104468.872155), ("('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)", 'released', 'fetch', {}, 'compute-task-1635104458.1960478', 1635104468.8722844), ('gather-dependencies', 'tcp://127.0.0.1:41795', {"('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)"}, 'stimulus', 1635104468.872853), ("('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)", 'fetch', 'flight', {}, 'ensure-communicating-1635104468.8727045', 1635104468.8728676), ('request-dep', 'tcp://127.0.0.1:41795', {"('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)"}, 'ensure-communicating-1635104468.8727045', 1635104470.384826), ('receive-dep-failed', 'tcp://127.0.0.1:41795', {"('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)"}, 'ensure-communicating-1635104468.8727045', 1635104503.914738), ("('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)", 'missing-dep'), ("('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)", 'flight', 'fetch', {}, 'ensure-communicating-1635104468.8727045', 1635104503.9147604), ("('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)", 'fetch', 'missing', {}, 'ensure-communicating-1635104503.91477', 1635104503.9147818), ("('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)", 'missing', 'fetch', {}, 'find-missing-1635104504.5391095', 1635104504.5398495), ('gather-dependencies', 'tcp://127.0.0.1:41795', {"('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)"}, 'stimulus', 1635104504.5398893), ("('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)", 'fetch', 'flight', {}, 'ensure-communicating-1635104504.5398538', 1635104504.5398993), ('request-dep', 'tcp://127.0.0.1:41795', {"('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)"}, 'ensure-communicating-1635104504.5398538', 1635104504.5399654), ("('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)", 'flight', 'cancelled', {}, 'processing-released-1635104504.5400004', 1635104504.5406826), ("('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)", 'cancelled', 'cancelled', {"('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)": 'released'}, 'processing-released-1635104504.5400004', 1635104504.5406876), ("('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)", 'flight', 'cancelled', {"('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)": 'released'}, 'processing-released-1635104504.5400004', 1635104504.540689), ("('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)", 'release-key', 'processing-released-1635104504.5400004'), ("('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)", 'cancelled', 'released', {"('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)": 'forgotten'}, 'processing-released-1635104504.5400004', 1635104504.5407002), ("('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)", 'released', 'forgotten', {}, 'processing-released-1635104504.5400004', 1635104504.5407038), ('lost-during-gather', {"('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)"}, 'ensure-communicating-1635104504.5398538'), ('receive-dep', 'tcp://127.0.0.1:41795', {"('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)"}, 'ensure-communicating-1635104504.5398538', 1635104548.287821)])
Another error during `to_parquet`
distributed.worker - ERROR - Worker stream died during communication: tcp://127.0.0.1:37667
Traceback (most recent call last):
  File "/root/miniconda3/lib/python3.9/site-packages/tornado/iostream.py", line 867, in _read_to_buffer
    bytes_read = self.read_from_fd(buf)
  File "/root/miniconda3/lib/python3.9/site-packages/tornado/iostream.py", line 1140, in read_from_fd
    return self.socket.recv_into(buf, len(buf))
ConnectionResetError: [Errno 104] Connection reset by peer

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/root/miniconda3/lib/python3.9/site-packages/distributed/worker.py", line 2741, in gather_dep
    response = await get_data_from_worker(
  File "/root/miniconda3/lib/python3.9/site-packages/distributed/worker.py", line 4031, in get_data_from_worker
    return await retry_operation(_get_data, operation="get_data_from_worker")
  File "/root/miniconda3/lib/python3.9/site-packages/distributed/utils_comm.py", line 385, in retry_operation
    return await retry(
  File "/root/miniconda3/lib/python3.9/site-packages/distributed/utils_comm.py", line 370, in retry
    return await coro()
  File "/root/miniconda3/lib/python3.9/site-packages/distributed/worker.py", line 4011, in _get_data
    response = await send_recv(
  File "/root/miniconda3/lib/python3.9/site-packages/distributed/core.py", line 640, in send_recv
    response = await comm.read(deserializers=deserializers)
  File "/root/miniconda3/lib/python3.9/site-packages/distributed/comm/tcp.py", line 221, in read
    convert_stream_closed_error(self, e)
  File "/root/miniconda3/lib/python3.9/site-packages/distributed/comm/tcp.py", line 126, in convert_stream_closed_error
    raise CommClosedError(f"in {obj}: {exc.__class__.__name__}: {exc}") from exc
distributed.comm.core.CommClosedError: in <TCP (closed) Ephemeral Worker->Worker for gather local=tcp://127.0.0.1:37530 remote=tcp://127.0.0.1:37667>: ConnectionResetError: [Errno 104] Connection reset by peer
distributed.worker - ERROR - failed during get data with tcp://127.0.0.1:37667 -> tcp://127.0.0.1:34511
Traceback (most recent call last):
  File "/root/miniconda3/lib/python3.9/site-packages/tornado/iostream.py", line 867, in _read_to_buffer
    bytes_read = self.read_from_fd(buf)
  File "/root/miniconda3/lib/python3.9/site-packages/tornado/iostream.py", line 1140, in read_from_fd
    return self.socket.recv_into(buf, len(buf))
TimeoutError: [Errno 110] Connection timed out

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/root/miniconda3/lib/python3.9/site-packages/distributed/worker.py", line 1649, in get_data
    response = await comm.read(deserializers=serializers)
  File "/root/miniconda3/lib/python3.9/site-packages/distributed/comm/tcp.py", line 221, in read
    convert_stream_closed_error(self, e)
  File "/root/miniconda3/lib/python3.9/site-packages/distributed/comm/tcp.py", line 126, in convert_stream_closed_error
    raise CommClosedError(f"in {obj}: {exc.__class__.__name__}: {exc}") from exc
distributed.comm.core.CommClosedError: in <TCP (closed)  local=tcp://127.0.0.1:37667 remote=tcp://127.0.0.1:37530>: TimeoutError: [Errno 110] Connection timed out
distributed.worker - ERROR - Worker stream died during communication: tcp://127.0.0.1:36281
Traceback (most recent call last):
  File "/root/miniconda3/lib/python3.9/site-packages/tornado/iostream.py", line 867, in _read_to_buffer
    bytes_read = self.read_from_fd(buf)
  File "/root/miniconda3/lib/python3.9/site-packages/tornado/iostream.py", line 1140, in read_from_fd
    return self.socket.recv_into(buf, len(buf))
ConnectionResetError: [Errno 104] Connection reset by peer

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/root/miniconda3/lib/python3.9/site-packages/distributed/worker.py", line 2741, in gather_dep
    response = await get_data_from_worker(
  File "/root/miniconda3/lib/python3.9/site-packages/distributed/worker.py", line 4031, in get_data_from_worker
    return await retry_operation(_get_data, operation="get_data_from_worker")
  File "/root/miniconda3/lib/python3.9/site-packages/distributed/utils_comm.py", line 385, in retry_operation
    return await retry(
  File "/root/miniconda3/lib/python3.9/site-packages/distributed/utils_comm.py", line 370, in retry
    return await coro()
  File "/root/miniconda3/lib/python3.9/site-packages/distributed/worker.py", line 4011, in _get_data
    response = await send_recv(
  File "/root/miniconda3/lib/python3.9/site-packages/distributed/core.py", line 640, in send_recv
    response = await comm.read(deserializers=deserializers)
  File "/root/miniconda3/lib/python3.9/site-packages/distributed/comm/tcp.py", line 221, in read
    convert_stream_closed_error(self, e)
  File "/root/miniconda3/lib/python3.9/site-packages/distributed/comm/tcp.py", line 126, in convert_stream_closed_error
    raise CommClosedError(f"in {obj}: {exc.__class__.__name__}: {exc}") from exc
distributed.comm.core.CommClosedError: in <TCP (closed) Ephemeral Worker->Worker for gather local=tcp://127.0.0.1:45000 remote=tcp://127.0.0.1:36281>: ConnectionResetError: [Errno 104] Connection reset by peer
distributed.worker - ERROR - failed during get data with tcp://127.0.0.1:36281 -> tcp://127.0.0.1:34511
Traceback (most recent call last):
  File "/root/miniconda3/lib/python3.9/site-packages/tornado/iostream.py", line 867, in _read_to_buffer
    bytes_read = self.read_from_fd(buf)
  File "/root/miniconda3/lib/python3.9/site-packages/tornado/iostream.py", line 1140, in read_from_fd
    return self.socket.recv_into(buf, len(buf))
TimeoutError: [Errno 110] Connection timed out

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/root/miniconda3/lib/python3.9/site-packages/distributed/worker.py", line 1649, in get_data
    response = await comm.read(deserializers=serializers)
  File "/root/miniconda3/lib/python3.9/site-packages/distributed/comm/tcp.py", line 221, in read
    convert_stream_closed_error(self, e)
  File "/root/miniconda3/lib/python3.9/site-packages/distributed/comm/tcp.py", line 126, in convert_stream_closed_error
    raise CommClosedError(f"in {obj}: {exc.__class__.__name__}: {exc}") from exc
distributed.comm.core.CommClosedError: in <TCP (closed)  local=tcp://127.0.0.1:36281 remote=tcp://127.0.0.1:45000>: TimeoutError: [Errno 110] Connection timed out
tornado.application - ERROR - Exception in callback functools.partial(<bound method IOLoop._discard_future_result of <tornado.platform.asyncio.AsyncIOLoop object at 0x7f348fe7bac0>>, <Task finished name='Task-9116' coro=<Worker.heartbeat() done, defined at /root/miniconda3/lib/python3.9/site-packages/distributed/worker.py:1182> exception=OSError('Timed out during handshake while connecting to tcp://127.0.0.1:43431 after 30 s')>)
Traceback (most recent call last):
  File "/root/miniconda3/lib/python3.9/site-packages/distributed/comm/tcp.py", line 245, in write
    async def write(self, msg, serializers=None, on_error="message"):
asyncio.exceptions.CancelledError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/root/miniconda3/lib/python3.9/asyncio/tasks.py", line 452, in wait_for
    fut.result()
asyncio.exceptions.CancelledError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/root/miniconda3/lib/python3.9/site-packages/distributed/comm/core.py", line 320, in connect
    await asyncio.wait_for(comm.write(local_info), time_left())
  File "/root/miniconda3/lib/python3.9/asyncio/tasks.py", line 454, in wait_for
    raise exceptions.TimeoutError() from exc
asyncio.exceptions.TimeoutError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/root/miniconda3/lib/python3.9/site-packages/tornado/ioloop.py", line 741, in _run_callback
    ret = callback()
  File "/root/miniconda3/lib/python3.9/site-packages/tornado/ioloop.py", line 765, in _discard_future_result
    future.result()
  File "/root/miniconda3/lib/python3.9/site-packages/distributed/worker.py", line 1231, in heartbeat
    raise e
  File "/root/miniconda3/lib/python3.9/site-packages/distributed/worker.py", line 1190, in heartbeat
    response = await retry_operation(
  File "/root/miniconda3/lib/python3.9/site-packages/distributed/utils_comm.py", line 385, in retry_operation
    return await retry(
  File "/root/miniconda3/lib/python3.9/site-packages/distributed/utils_comm.py", line 370, in retry
    return await coro()
  File "/root/miniconda3/lib/python3.9/site-packages/distributed/core.py", line 860, in send_recv_from_rpc
    comm = await self.pool.connect(self.addr)
  File "/root/miniconda3/lib/python3.9/site-packages/distributed/core.py", line 1048, in connect
    raise exc
  File "/root/miniconda3/lib/python3.9/site-packages/distributed/core.py", line 1032, in connect
    comm = await fut
  File "/root/miniconda3/lib/python3.9/site-packages/distributed/comm/core.py", line 324, in connect
    raise OSError(
OSError: Timed out during handshake while connecting to tcp://127.0.0.1:43431 after 30 s
distributed.worker - ERROR - failed during get data with tcp://127.0.0.1:36281 -> tcp://127.0.0.1:37667
Traceback (most recent call last):
  File "/root/miniconda3/lib/python3.9/site-packages/tornado/iostream.py", line 867, in _read_to_buffer
    bytes_read = self.read_from_fd(buf)
  File "/root/miniconda3/lib/python3.9/site-packages/tornado/iostream.py", line 1140, in read_from_fd
    return self.socket.recv_into(buf, len(buf))
TimeoutError: [Errno 110] Connection timed out

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/root/miniconda3/lib/python3.9/site-packages/distributed/worker.py", line 1649, in get_data
    response = await comm.read(deserializers=serializers)
  File "/root/miniconda3/lib/python3.9/site-packages/distributed/comm/tcp.py", line 221, in read
    convert_stream_closed_error(self, e)
  File "/root/miniconda3/lib/python3.9/site-packages/distributed/comm/tcp.py", line 126, in convert_stream_closed_error
    raise CommClosedError(f"in {obj}: {exc.__class__.__name__}: {exc}") from exc
distributed.comm.core.CommClosedError: in <TCP (closed)  local=tcp://127.0.0.1:36281 remote=tcp://127.0.0.1:44898>: TimeoutError: [Errno 110] Connection timed out
distributed.worker - ERROR - Worker stream died during communication: tcp://127.0.0.1:36281
Traceback (most recent call last):
  File "/root/miniconda3/lib/python3.9/site-packages/tornado/iostream.py", line 867, in _read_to_buffer
    bytes_read = self.read_from_fd(buf)
  File "/root/miniconda3/lib/python3.9/site-packages/tornado/iostream.py", line 1140, in read_from_fd
    return self.socket.recv_into(buf, len(buf))
ConnectionResetError: [Errno 104] Connection reset by peer

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/root/miniconda3/lib/python3.9/site-packages/distributed/worker.py", line 2741, in gather_dep
    response = await get_data_from_worker(
  File "/root/miniconda3/lib/python3.9/site-packages/distributed/worker.py", line 4031, in get_data_from_worker
    return await retry_operation(_get_data, operation="get_data_from_worker")
  File "/root/miniconda3/lib/python3.9/site-packages/distributed/utils_comm.py", line 385, in retry_operation
    return await retry(
  File "/root/miniconda3/lib/python3.9/site-packages/distributed/utils_comm.py", line 370, in retry
    return await coro()
  File "/root/miniconda3/lib/python3.9/site-packages/distributed/worker.py", line 4011, in _get_data
    response = await send_recv(
  File "/root/miniconda3/lib/python3.9/site-packages/distributed/core.py", line 640, in send_recv
    response = await comm.read(deserializers=deserializers)
  File "/root/miniconda3/lib/python3.9/site-packages/distributed/comm/tcp.py", line 221, in read
    convert_stream_closed_error(self, e)
  File "/root/miniconda3/lib/python3.9/site-packages/distributed/comm/tcp.py", line 126, in convert_stream_closed_error
    raise CommClosedError(f"in {obj}: {exc.__class__.__name__}: {exc}") from exc
distributed.comm.core.CommClosedError: in <TCP (closed) Ephemeral Worker->Worker for gather local=tcp://127.0.0.1:44898 remote=tcp://127.0.0.1:36281>: ConnectionResetError: [Errno 104] Connection reset by peer
distributed.worker - ERROR - failed during get data with tcp://127.0.0.1:43515 -> tcp://127.0.0.1:37275
Traceback (most recent call last):
  File "/root/miniconda3/lib/python3.9/site-packages/tornado/iostream.py", line 867, in _read_to_buffer
    bytes_read = self.read_from_fd(buf)
  File "/root/miniconda3/lib/python3.9/site-packages/tornado/iostream.py", line 1140, in read_from_fd
    return self.socket.recv_into(buf, len(buf))
TimeoutError: [Errno 110] Connection timed out

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/root/miniconda3/lib/python3.9/site-packages/distributed/worker.py", line 1649, in get_data
    response = await comm.read(deserializers=serializers)
  File "/root/miniconda3/lib/python3.9/site-packages/distributed/comm/tcp.py", line 221, in read
    convert_stream_closed_error(self, e)
  File "/root/miniconda3/lib/python3.9/site-packages/distributed/comm/tcp.py", line 126, in convert_stream_closed_error
    raise CommClosedError(f"in {obj}: {exc.__class__.__name__}: {exc}") from exc
distributed.comm.core.CommClosedError: in <TCP (closed)  local=tcp://127.0.0.1:43515 remote=tcp://127.0.0.1:45898>: TimeoutError: [Errno 110] Connection timed out
distributed.worker - ERROR - Worker stream died during communication: tcp://127.0.0.1:43515
Traceback (most recent call last):
  File "/root/miniconda3/lib/python3.9/site-packages/tornado/iostream.py", line 867, in _read_to_buffer
    bytes_read = self.read_from_fd(buf)
  File "/root/miniconda3/lib/python3.9/site-packages/tornado/iostream.py", line 1140, in read_from_fd
    return self.socket.recv_into(buf, len(buf))
ConnectionResetError: [Errno 104] Connection reset by peer

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/root/miniconda3/lib/python3.9/site-packages/distributed/worker.py", line 2741, in gather_dep
    response = await get_data_from_worker(
  File "/root/miniconda3/lib/python3.9/site-packages/distributed/worker.py", line 4031, in get_data_from_worker
    return await retry_operation(_get_data, operation="get_data_from_worker")
  File "/root/miniconda3/lib/python3.9/site-packages/distributed/utils_comm.py", line 385, in retry_operation
    return await retry(
  File "/root/miniconda3/lib/python3.9/site-packages/distributed/utils_comm.py", line 370, in retry
    return await coro()
  File "/root/miniconda3/lib/python3.9/site-packages/distributed/worker.py", line 4011, in _get_data
    response = await send_recv(
  File "/root/miniconda3/lib/python3.9/site-packages/distributed/core.py", line 640, in send_recv
    response = await comm.read(deserializers=deserializers)
  File "/root/miniconda3/lib/python3.9/site-packages/distributed/comm/tcp.py", line 221, in read
    convert_stream_closed_error(self, e)
  File "/root/miniconda3/lib/python3.9/site-packages/distributed/comm/tcp.py", line 126, in convert_stream_closed_error
    raise CommClosedError(f"in {obj}: {exc.__class__.__name__}: {exc}") from exc
distributed.comm.core.CommClosedError: in <TCP (closed) Ephemeral Worker->Worker for gather local=tcp://127.0.0.1:45898 remote=tcp://127.0.0.1:43515>: ConnectionResetError: [Errno 104] Connection reset by peer
distributed.worker - ERROR - Worker stream died during communication: tcp://127.0.0.1:37275
Traceback (most recent call last):
  File "/root/miniconda3/lib/python3.9/site-packages/tornado/iostream.py", line 867, in _read_to_buffer
    bytes_read = self.read_from_fd(buf)
  File "/root/miniconda3/lib/python3.9/site-packages/tornado/iostream.py", line 1140, in read_from_fd
    return self.socket.recv_into(buf, len(buf))
ConnectionResetError: [Errno 104] Connection reset by peer

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/root/miniconda3/lib/python3.9/site-packages/distributed/worker.py", line 2741, in gather_dep
    response = await get_data_from_worker(
  File "/root/miniconda3/lib/python3.9/site-packages/distributed/worker.py", line 4031, in get_data_from_worker
    return await retry_operation(_get_data, operation="get_data_from_worker")
  File "/root/miniconda3/lib/python3.9/site-packages/distributed/utils_comm.py", line 385, in retry_operation
    return await retry(
  File "/root/miniconda3/lib/python3.9/site-packages/distributed/utils_comm.py", line 370, in retry
    return await coro()
  File "/root/miniconda3/lib/python3.9/site-packages/distributed/worker.py", line 4011, in _get_data
    response = await send_recv(
  File "/root/miniconda3/lib/python3.9/site-packages/distributed/core.py", line 640, in send_recv
    response = await comm.read(deserializers=deserializers)
  File "/root/miniconda3/lib/python3.9/site-packages/distributed/comm/tcp.py", line 221, in read
    convert_stream_closed_error(self, e)
  File "/root/miniconda3/lib/python3.9/site-packages/distributed/comm/tcp.py", line 126, in convert_stream_closed_error
    raise CommClosedError(f"in {obj}: {exc.__class__.__name__}: {exc}") from exc
distributed.comm.core.CommClosedError: in <TCP (closed) Ephemeral Worker->Worker for gather local=tcp://127.0.0.1:40650 remote=tcp://127.0.0.1:37275>: ConnectionResetError: [Errno 104] Connection reset by peer
distributed.worker - ERROR - Worker stream died during communication: tcp://127.0.0.1:43515
Traceback (most recent call last):
  File "/root/miniconda3/lib/python3.9/site-packages/tornado/iostream.py", line 867, in _read_to_buffer
    bytes_read = self.read_from_fd(buf)
  File "/root/miniconda3/lib/python3.9/site-packages/tornado/iostream.py", line 1140, in read_from_fd
    return self.socket.recv_into(buf, len(buf))
ConnectionResetError: [Errno 104] Connection reset by peer

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/root/miniconda3/lib/python3.9/site-packages/distributed/worker.py", line 2741, in gather_dep
    response = await get_data_from_worker(
  File "/root/miniconda3/lib/python3.9/site-packages/distributed/worker.py", line 4031, in get_data_from_worker
    return await retry_operation(_get_data, operation="get_data_from_worker")
  File "/root/miniconda3/lib/python3.9/site-packages/distributed/utils_comm.py", line 385, in retry_operation
    return await retry(
  File "/root/miniconda3/lib/python3.9/site-packages/distributed/utils_comm.py", line 370, in retry
    return await coro()
  File "/root/miniconda3/lib/python3.9/site-packages/distributed/worker.py", line 4011, in _get_data
    response = await send_recv(
  File "/root/miniconda3/lib/python3.9/site-packages/distributed/core.py", line 640, in send_recv
    response = await comm.read(deserializers=deserializers)
  File "/root/miniconda3/lib/python3.9/site-packages/distributed/comm/tcp.py", line 221, in read
    convert_stream_closed_error(self, e)
  File "/root/miniconda3/lib/python3.9/site-packages/distributed/comm/tcp.py", line 126, in convert_stream_closed_error
    raise CommClosedError(f"in {obj}: {exc.__class__.__name__}: {exc}") from exc
distributed.comm.core.CommClosedError: in <TCP (closed) Ephemeral Worker->Worker for gather local=tcp://127.0.0.1:46346 remote=tcp://127.0.0.1:43515>: ConnectionResetError: [Errno 104] Connection reset by peer
distributed.worker - ERROR - failed during get data with tcp://127.0.0.1:43515 -> tcp://127.0.0.1:37667
Traceback (most recent call last):
  File "/root/miniconda3/lib/python3.9/site-packages/tornado/iostream.py", line 867, in _read_to_buffer
    bytes_read = self.read_from_fd(buf)
  File "/root/miniconda3/lib/python3.9/site-packages/tornado/iostream.py", line 1140, in read_from_fd
    return self.socket.recv_into(buf, len(buf))
TimeoutError: [Errno 110] Connection timed out

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/root/miniconda3/lib/python3.9/site-packages/distributed/worker.py", line 1649, in get_data
    response = await comm.read(deserializers=serializers)
  File "/root/miniconda3/lib/python3.9/site-packages/distributed/comm/tcp.py", line 221, in read
    convert_stream_closed_error(self, e)
  File "/root/miniconda3/lib/python3.9/site-packages/distributed/comm/tcp.py", line 126, in convert_stream_closed_error
    raise CommClosedError(f"in {obj}: {exc.__class__.__name__}: {exc}") from exc
distributed.comm.core.CommClosedError: in <TCP (closed)  local=tcp://127.0.0.1:43515 remote=tcp://127.0.0.1:46346>: TimeoutError: [Errno 110] Connection timed out
distributed.worker - ERROR - failed during get data with tcp://127.0.0.1:37275 -> tcp://127.0.0.1:34511
Traceback (most recent call last):
  File "/root/miniconda3/lib/python3.9/site-packages/tornado/iostream.py", line 867, in _read_to_buffer
    bytes_read = self.read_from_fd(buf)
  File "/root/miniconda3/lib/python3.9/site-packages/tornado/iostream.py", line 1140, in read_from_fd
    return self.socket.recv_into(buf, len(buf))
TimeoutError: [Errno 110] Connection timed out

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/root/miniconda3/lib/python3.9/site-packages/distributed/worker.py", line 1649, in get_data
    response = await comm.read(deserializers=serializers)
  File "/root/miniconda3/lib/python3.9/site-packages/distributed/comm/tcp.py", line 221, in read
    convert_stream_closed_error(self, e)
  File "/root/miniconda3/lib/python3.9/site-packages/distributed/comm/tcp.py", line 126, in convert_stream_closed_error
    raise CommClosedError(f"in {obj}: {exc.__class__.__name__}: {exc}") from exc
distributed.comm.core.CommClosedError: in <TCP (closed)  local=tcp://127.0.0.1:37275 remote=tcp://127.0.0.1:40650>: TimeoutError: [Errno 110] Connection timed out
distributed.worker - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker.html#memtrim for more information. -- Unmanaged memory: 29.25 GiB -- Worker memory limit: 46.57 GiB
distributed.worker - ERROR - failed during get data with tcp://127.0.0.1:37667 -> tcp://127.0.0.1:36281
Traceback (most recent call last):
  File "/root/miniconda3/lib/python3.9/site-packages/tornado/iostream.py", line 867, in _read_to_buffer
    bytes_read = self.read_from_fd(buf)
  File "/root/miniconda3/lib/python3.9/site-packages/tornado/iostream.py", line 1140, in read_from_fd
    return self.socket.recv_into(buf, len(buf))
TimeoutError: [Errno 110] Connection timed out

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/root/miniconda3/lib/python3.9/site-packages/distributed/worker.py", line 1649, in get_data
    response = await comm.read(deserializers=serializers)
  File "/root/miniconda3/lib/python3.9/site-packages/distributed/comm/tcp.py", line 221, in read
    convert_stream_closed_error(self, e)
  File "/root/miniconda3/lib/python3.9/site-packages/distributed/comm/tcp.py", line 126, in convert_stream_closed_error
    raise CommClosedError(f"in {obj}: {exc.__class__.__name__}: {exc}") from exc
distributed.comm.core.CommClosedError: in <TCP (closed)  local=tcp://127.0.0.1:37667 remote=tcp://127.0.0.1:37070>: TimeoutError: [Errno 110] Connection timed out
distributed.worker - ERROR - Worker stream died during communication: tcp://127.0.0.1:37667
Traceback (most recent call last):
  File "/root/miniconda3/lib/python3.9/site-packages/tornado/iostream.py", line 867, in _read_to_buffer
    bytes_read = self.read_from_fd(buf)
  File "/root/miniconda3/lib/python3.9/site-packages/tornado/iostream.py", line 1140, in read_from_fd
    return self.socket.recv_into(buf, len(buf))
ConnectionResetError: [Errno 104] Connection reset by peer

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/root/miniconda3/lib/python3.9/site-packages/distributed/worker.py", line 2741, in gather_dep
    response = await get_data_from_worker(
  File "/root/miniconda3/lib/python3.9/site-packages/distributed/worker.py", line 4031, in get_data_from_worker
    return await retry_operation(_get_data, operation="get_data_from_worker")
  File "/root/miniconda3/lib/python3.9/site-packages/distributed/utils_comm.py", line 385, in retry_operation
    return await retry(
  File "/root/miniconda3/lib/python3.9/site-packages/distributed/utils_comm.py", line 370, in retry
    return await coro()
  File "/root/miniconda3/lib/python3.9/site-packages/distributed/worker.py", line 4011, in _get_data
    response = await send_recv(
  File "/root/miniconda3/lib/python3.9/site-packages/distributed/core.py", line 640, in send_recv
    response = await comm.read(deserializers=deserializers)
  File "/root/miniconda3/lib/python3.9/site-packages/distributed/comm/tcp.py", line 221, in read
    convert_stream_closed_error(self, e)
  File "/root/miniconda3/lib/python3.9/site-packages/distributed/comm/tcp.py", line 126, in convert_stream_closed_error
    raise CommClosedError(f"in {obj}: {exc.__class__.__name__}: {exc}") from exc
distributed.comm.core.CommClosedError: in <TCP (closed) Ephemeral Worker->Worker for gather local=tcp://127.0.0.1:37070 remote=tcp://127.0.0.1:37667>: ConnectionResetError: [Errno 104] Connection reset by peer
distributed.worker - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker.html#memtrim for more information. -- Unmanaged memory: 33.46 GiB -- Worker memory limit: 46.57 GiB
distributed.worker - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker.html#memtrim for more information. -- Unmanaged memory: 33.46 GiB -- Worker memory limit: 46.57 GiB
distributed.worker - WARNING - Compute Failed
Function:  unpack
args:      (<dask.dataframe.shuffle_service.ShuffleService object at 0x7f4627812d00>, 240, None)
kwargs:    {}
Exception: 'AttributeError("\'ShuffleService\' object has no attribute \'retrieve_futures\'")'

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
/tmp/ipykernel_815540/1044375382.py in <module>
----> 1 data.to_parquet('cache-imported.gz.parq', compression='gzip')

~/miniconda3/lib/python3.9/site-packages/dask/dataframe/core.py in to_parquet(self, path, *args, **kwargs)
   4558         from .io import to_parquet
   4559 
-> 4560         return to_parquet(self, path, *args, **kwargs)
   4561 
   4562     def to_orc(self, path, *args, **kwargs):

~/miniconda3/lib/python3.9/site-packages/dask/dataframe/io/parquet/core.py in to_parquet(df, path, engine, compression, write_index, append, overwrite, ignore_divisions, partition_on, storage_options, custom_metadata, write_metadata_file, compute, compute_kwargs, schema, **kwargs)
    723     if compute:
    724         if write_metadata_file:
--> 725             return compute_as_if_collection(
    726                 DataFrame, graph, (final_name, 0), **compute_kwargs
    727             )

~/miniconda3/lib/python3.9/site-packages/dask/base.py in compute_as_if_collection(cls, dsk, keys, scheduler, get, **kwargs)
    313     schedule = get_scheduler(scheduler=scheduler, cls=cls, get=get)
    314     dsk2 = optimization_function(cls)(dsk, keys, **kwargs)
--> 315     return schedule(dsk2, keys, **kwargs)
    316 
    317 

~/miniconda3/lib/python3.9/site-packages/distributed/client.py in get(self, dsk, keys, workers, allow_other_workers, resources, sync, asynchronous, direct, retries, priority, fifo_timeout, actors, **kwargs)
   2691                     should_rejoin = False
   2692             try:
-> 2693                 results = self.gather(packed, asynchronous=asynchronous, direct=direct)
   2694             finally:
   2695                 for f in futures.values():

~/miniconda3/lib/python3.9/site-packages/distributed/client.py in gather(self, futures, errors, direct, asynchronous)
   1967             else:
   1968                 local_worker = None
-> 1969             return self.sync(
   1970                 self._gather,
   1971                 futures,

~/miniconda3/lib/python3.9/site-packages/distributed/client.py in sync(self, func, asynchronous, callback_timeout, *args, **kwargs)
    863             return future
    864         else:
--> 865             return sync(
    866                 self.loop, func, *args, callback_timeout=callback_timeout, **kwargs
    867             )

~/miniconda3/lib/python3.9/site-packages/distributed/utils.py in sync(loop, func, callback_timeout, *args, **kwargs)
    325     if error[0]:
    326         typ, exc, tb = error[0]
--> 327         raise exc.with_traceback(tb)
    328     else:
    329         return result[0]

~/miniconda3/lib/python3.9/site-packages/distributed/utils.py in f()
    308             if callback_timeout is not None:
    309                 future = asyncio.wait_for(future, callback_timeout)
--> 310             result[0] = yield future
    311         except Exception:
    312             error[0] = sys.exc_info()

~/miniconda3/lib/python3.9/site-packages/tornado/gen.py in run(self)
    760 
    761                     try:
--> 762                         value = future.result()
    763                     except Exception:
    764                         exc_info = sys.exc_info()

~/miniconda3/lib/python3.9/site-packages/distributed/client.py in _gather(self, futures, errors, direct, local_worker)
   1832                             exc = CancelledError(key)
   1833                         else:
-> 1834                             raise exception.with_traceback(traceback)
   1835                         raise exc
   1836                     if errors == "skip":

~/miniconda3/lib/python3.9/site-packages/dask/dataframe/shuffle_service.py in unpack()
     49     This is a task in the task graph
     50     """
---> 51     return service.get(i)
     52 
     53 

~/miniconda3/lib/python3.9/site-packages/dask/dataframe/shuffle_service.py in get()
    726                 # ^ this causes the `process_received_shards` threads to flush any data out
    727                 # of `worker.shuffler`, then stop.
--> 728                 concurrent.futures.wait(self.retrieve_futures)
    729                 self.retrieve_futures.clear()
    730 

AttributeError: 'ShuffleService' object has no attribute 'retrieve_futures'

Environment

  • Dask version: '2021.09.1+26.gfd1b02b6' (#8223)
  • Pandas: 1.3.1
  • Python version: 3.9.5
  • Operating System: Ubuntu 18.04.5 LTS
  • Install method (conda, pip, source): conda

Issue Analytics

  • State:open
  • Created 2 years ago
  • Comments:6 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
GenevieveBuckleycommented, Oct 28, 2021

Also, this blogpost might also be interesting if you haven’t had a chance to read it: https://coiled.io/blog/better-shuffling-in-dask-a-proof-of-concept/

1reaction
GenevieveBuckleycommented, Oct 28, 2021

Hi @DahnJ

You might have already seen this related discussion about how the shuflle service could potentially be redesigned to be more robust. Link to document.

It’s great to see people trying this stuff out

Read more comments on GitHub >

github_iconTop Results From Across the Web

(old) Unofficial List of Questions, Clarifications, and Typos
Q: How do equipment cards Zipline and Jetpack work? ... This means that unless you're playing Deadly Space mode, or the Evac Log...
Read more >
Kickstarter Tabletop Alert: Do the 'Echidna Shuffle' - GeekDad
Note: My review is based on a prototype, so final component quality ... You may move any echidna on the board if there...
Read more >
Exoshuffle: Large-Scale Shuffle at the Application Level
To reduce latency, training workers must be able to consume shuffled outputs as they ... Current data warehouses primarily use hard disk drives...
Read more >
(PDF) Exoshuffle: Large-Scale Shuffle at the Application Level
PDF | Shuffle is a key primitive in large-scale data processing applications. ... Current data warehouses primarily use hard disk drives.
Read more >
The iPod is dead - TechCrunch
The iPod's death has been a protracted one. I can hear those “the iPod was still around?” posts clogging up the comments section...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found