question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Concurrent os.makedirs for dask-worker-space can lead to worker failure

See original GitHub issue

What happened:

This bug occurs randomly as it is one that is raised due to multiple processes creating the dask-worker-space folder if it does not exist. I only stumble over it every ~100 scripts I launch. If you run:

from dask.distributed import Client
client = Client(memory_limit=0, n_workers=Environment.DASK_PROCESSES)

a lot of times with a high enough number for Environment.DASK_PROCESSES (16+), the following error will occur.

Process Dask Worker process (from Nanny):
Traceback (most recent call last):
  File "/home/user/.pyenv/versions/3.8.2/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/home/user/.pyenv/versions/3.8.2/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/home/user/.pyenv/versions/3.8.2/lib/python3.8/site-packages/distributed/process.py", line 191, in _run
    target(*args, **kwargs)
  File "/home/user/.pyenv/versions/3.8.2/lib/python3.8/site-packages/distributed/nanny.py", line 728, in _run
    worker = Worker(**worker_kwargs)
  File "/home/user/.pyenv/versions/3.8.2/lib/python3.8/site-packages/distributed/worker.py", line 489, in __init__
    os.makedirs(local_directory)
  File "/home/user/.pyenv/versions/3.8.2/lib/python3.8/os.py", line 223, in makedirs
    mkdir(name, mode)
FileExistsError: [Errno 17] File exists: '/dask-worker-space'
distributed.utils - ERROR - addresses should be strings or tuples, got None
Traceback (most recent call last):
  File "/home/user/.pyenv/versions/3.8.2/lib/python3.8/site-packages/distributed/utils.py", line 656, in log_errors
    yield
  File "/home/user/.pyenv/versions/3.8.2/lib/python3.8/site-packages/distributed/scheduler.py", line 2208, in remove_worker
    address = self.coerce_address(address)
  File "/home/user/.pyenv/versions/3.8.2/lib/python3.8/site-packages/distributed/scheduler.py", line 4946, in coerce_address
    raise TypeError("addresses should be strings or tuples, got %r" % (addr,))
TypeError: addresses should be strings or tuples, got None
distributed.core - ERROR - addresses should be strings or tuples, got None
Traceback (most recent call last):
  File "/home/user/.pyenv/versions/3.8.2/lib/python3.8/site-packages/distributed/core.py", line 513, in handle_comm
    result = await result
  File "/home/user/.pyenv/versions/3.8.2/lib/python3.8/site-packages/distributed/scheduler.py", line 2208, in remove_worker
    address = self.coerce_address(address)
  File "/home/user/.pyenv/versions/3.8.2/lib/python3.8/site-packages/distributed/scheduler.py", line 4946, in coerce_address
    raise TypeError("addresses should be strings or tuples, got %r" % (addr,))
TypeError: addresses should be strings or tuples, got None
tornado.application - ERROR - Exception in callback functools.partial(<bound method IOLoop._discard_future_result of <tornado.platform.asyncio.AsyncIOLoop object at 0x7ff497641f10>>, <Task finished name='Task-14' coro=<Nanny._on_exit() done, defined at /home/user/.pyenv/versions/3.8.2/lib/python3.8/site-packages/distributed/nanny.py:440> exception=TypeError('addresses should be strings or tuples, got None')>)
Traceback (most recent call last):
  File "/home/user/.pyenv/versions/3.8.2/lib/python3.8/site-packages/tornado/ioloop.py", line 743, in _run_callback
    ret = callback()
  File "/home/user/.pyenv/versions/3.8.2/lib/python3.8/site-packages/tornado/ioloop.py", line 767, in _discard_future_result
    future.result()
  File "/home/user/.pyenv/versions/3.8.2/lib/python3.8/site-packages/distributed/nanny.py", line 443, in _on_exit
    await self.scheduler.unregister(address=self.worker_address)
  File "/home/user/.pyenv/versions/3.8.2/lib/python3.8/site-packages/distributed/core.py", line 861, in send_recv_from_rpc
    result = await send_recv(comm=comm, op=key, **kwargs)
  File "/home/user/.pyenv/versions/3.8.2/lib/python3.8/site-packages/distributed/core.py", line 660, in send_recv
    raise exc.with_traceback(tb)
  File "/home/user/.pyenv/versions/3.8.2/lib/python3.8/site-packages/distributed/core.py", line 513, in handle_comm
    result = await result
  File "/home/user/.pyenv/versions/3.8.2/lib/python3.8/site-packages/distributed/scheduler.py", line 2208, in remove_worker
    address = self.coerce_address(address)
  File "/home/user/.pyenv/versions/3.8.2/lib/python3.8/site-packages/distributed/scheduler.py", line 4946, in coerce_address
    raise TypeError("addresses should be strings or tuples, got %r" % (addr,))
TypeError: addresses should be strings or tuples, got None

The error seems to be this line in the worker.py file: https://github.com/dask/distributed/blob/c67705f3f513de5bc09b897c400011b543ff0f7c/distributed/worker.py#L489

Between checking the if statement and creating the folder, another process might create the folder and os.makedirs will fail.

What you expected to happen: I would expect the code to simply continue, if the folder exists, so doing os.makedirs(folder, exist_ok=True).

Please be concise with code posted. See guidelines below on how to provide a good bug report:

Bug reports that follow these guidelines are easier to diagnose, and so are often handled much more quickly. –>

Minimal Complete Verifiable Example: The following raises that issue by chance.

from dask.distributed import Client
client = Client(memory_limit=0, n_workers=Environment.DASK_PROCESSES)

Anything else we need to know?: I am happy to submit a PR to fix this (and yes it would be tiny 😉 ).

Environment:

  • Dask version: 2.19.0
  • Python version: 3.8.2
  • Operating System: docker python:3.8 image
  • Install method (conda, pip, source): pip (pipenv)

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Reactions:2
  • Comments:5 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
jendrikjoecommented, Jul 21, 2020

That is good to be closed 👍 Thanks for merging the PR 🙂

0reactions
jakirkhamcommented, Jul 21, 2020

Thanks for the PR 😄

Read more comments on GitHub >

github_iconTop Results From Across the Web

python - os.makedirs() Occasionally Fails - Stack Overflow
If the directory can not be found the program attempts to create it: if not os.path.exists(path): os.makedirs(path).
Read more >
Python | os.makedirs() method - GeeksforGeeks
makedirs () method will create all unavailable/missing directory in the specified path. 'GeeksForGeeks' and 'Authors' will be created first then ...
Read more >
How can I safely create a nested directory | Edureka Community
makedirs calls, the os.makedirs will fail with an OSError. Unfortunately, blanket-catching OSError and continuing is not foolproof, as it will ...
Read more >
Randomly at the end of a program I get an error that ... - GitHub
I am running the script in a job, so when it exits with an error, the job also fails. I use initialize function...
Read more >
Correct way to create a directory in Python - Deepak Nagaraj
Python documentation mentions that os.makedirs() can fail if the leaf directory exists: Raises an error exception if the leaf directory already ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found