Concurrent os.makedirs for dask-worker-space can lead to worker failure
See original GitHub issueWhat happened:
This bug occurs randomly as it is one that is raised due to multiple processes creating the dask-worker-space
folder if it does not exist. I only stumble over it every ~100 scripts I launch.
If you run:
from dask.distributed import Client
client = Client(memory_limit=0, n_workers=Environment.DASK_PROCESSES)
a lot of times with a high enough number for Environment.DASK_PROCESSES
(16+), the following error will occur.
Process Dask Worker process (from Nanny):
Traceback (most recent call last):
File "/home/user/.pyenv/versions/3.8.2/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
self.run()
File "/home/user/.pyenv/versions/3.8.2/lib/python3.8/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/home/user/.pyenv/versions/3.8.2/lib/python3.8/site-packages/distributed/process.py", line 191, in _run
target(*args, **kwargs)
File "/home/user/.pyenv/versions/3.8.2/lib/python3.8/site-packages/distributed/nanny.py", line 728, in _run
worker = Worker(**worker_kwargs)
File "/home/user/.pyenv/versions/3.8.2/lib/python3.8/site-packages/distributed/worker.py", line 489, in __init__
os.makedirs(local_directory)
File "/home/user/.pyenv/versions/3.8.2/lib/python3.8/os.py", line 223, in makedirs
mkdir(name, mode)
FileExistsError: [Errno 17] File exists: '/dask-worker-space'
distributed.utils - ERROR - addresses should be strings or tuples, got None
Traceback (most recent call last):
File "/home/user/.pyenv/versions/3.8.2/lib/python3.8/site-packages/distributed/utils.py", line 656, in log_errors
yield
File "/home/user/.pyenv/versions/3.8.2/lib/python3.8/site-packages/distributed/scheduler.py", line 2208, in remove_worker
address = self.coerce_address(address)
File "/home/user/.pyenv/versions/3.8.2/lib/python3.8/site-packages/distributed/scheduler.py", line 4946, in coerce_address
raise TypeError("addresses should be strings or tuples, got %r" % (addr,))
TypeError: addresses should be strings or tuples, got None
distributed.core - ERROR - addresses should be strings or tuples, got None
Traceback (most recent call last):
File "/home/user/.pyenv/versions/3.8.2/lib/python3.8/site-packages/distributed/core.py", line 513, in handle_comm
result = await result
File "/home/user/.pyenv/versions/3.8.2/lib/python3.8/site-packages/distributed/scheduler.py", line 2208, in remove_worker
address = self.coerce_address(address)
File "/home/user/.pyenv/versions/3.8.2/lib/python3.8/site-packages/distributed/scheduler.py", line 4946, in coerce_address
raise TypeError("addresses should be strings or tuples, got %r" % (addr,))
TypeError: addresses should be strings or tuples, got None
tornado.application - ERROR - Exception in callback functools.partial(<bound method IOLoop._discard_future_result of <tornado.platform.asyncio.AsyncIOLoop object at 0x7ff497641f10>>, <Task finished name='Task-14' coro=<Nanny._on_exit() done, defined at /home/user/.pyenv/versions/3.8.2/lib/python3.8/site-packages/distributed/nanny.py:440> exception=TypeError('addresses should be strings or tuples, got None')>)
Traceback (most recent call last):
File "/home/user/.pyenv/versions/3.8.2/lib/python3.8/site-packages/tornado/ioloop.py", line 743, in _run_callback
ret = callback()
File "/home/user/.pyenv/versions/3.8.2/lib/python3.8/site-packages/tornado/ioloop.py", line 767, in _discard_future_result
future.result()
File "/home/user/.pyenv/versions/3.8.2/lib/python3.8/site-packages/distributed/nanny.py", line 443, in _on_exit
await self.scheduler.unregister(address=self.worker_address)
File "/home/user/.pyenv/versions/3.8.2/lib/python3.8/site-packages/distributed/core.py", line 861, in send_recv_from_rpc
result = await send_recv(comm=comm, op=key, **kwargs)
File "/home/user/.pyenv/versions/3.8.2/lib/python3.8/site-packages/distributed/core.py", line 660, in send_recv
raise exc.with_traceback(tb)
File "/home/user/.pyenv/versions/3.8.2/lib/python3.8/site-packages/distributed/core.py", line 513, in handle_comm
result = await result
File "/home/user/.pyenv/versions/3.8.2/lib/python3.8/site-packages/distributed/scheduler.py", line 2208, in remove_worker
address = self.coerce_address(address)
File "/home/user/.pyenv/versions/3.8.2/lib/python3.8/site-packages/distributed/scheduler.py", line 4946, in coerce_address
raise TypeError("addresses should be strings or tuples, got %r" % (addr,))
TypeError: addresses should be strings or tuples, got None
The error seems to be this line in the worker.py
file: https://github.com/dask/distributed/blob/c67705f3f513de5bc09b897c400011b543ff0f7c/distributed/worker.py#L489
Between checking the if statement and creating the folder, another process might create the folder and os.makedirs
will fail.
What you expected to happen:
I would expect the code to simply continue, if the folder exists, so doing os.makedirs(folder, exist_ok=True)
.
Please be concise with code posted. See guidelines below on how to provide a good bug report:
- Craft Minimal Bug Reports http://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports
- Minimal Complete Verifiable Examples https://stackoverflow.com/help/mcve
Bug reports that follow these guidelines are easier to diagnose, and so are often handled much more quickly. –>
Minimal Complete Verifiable Example: The following raises that issue by chance.
from dask.distributed import Client
client = Client(memory_limit=0, n_workers=Environment.DASK_PROCESSES)
Anything else we need to know?: I am happy to submit a PR to fix this (and yes it would be tiny 😉 ).
Environment:
- Dask version: 2.19.0
- Python version: 3.8.2
- Operating System: docker python:3.8 image
- Install method (conda, pip, source): pip (pipenv)
Issue Analytics
- State:
- Created 3 years ago
- Reactions:2
- Comments:5 (5 by maintainers)
Top GitHub Comments
That is good to be closed 👍 Thanks for merging the PR 🙂
Thanks for the PR 😄