question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

FileExistsError: File exists: 'storage'

See original GitHub issue

I have been seeing these quite often but the issues seem to be transient in nature – not all nodes experience this.

I’m running on Azure ML Compute, 20 nodes with 4 V100s with 16GB. Some of the nodes report this:

Starting the daemon thread to refresh tokens in background for process with pid = 224
Entering Run History Context Manager.
Preparing to call script [ start_worker.py ] with arguments: ['--scheduler_ip_port=172.18.0.4:8786', '--use_gpu=True', '--n_gpus_per_node=4']
After variable expansion, calling script [ start_worker.py ] with arguments: ['--scheduler_ip_port=172.18.0.4:8786', '--use_gpu=True', '--n_gpus_per_node=4']

- scheduler is  172.18.0.4:8786
- args:  Namespace(n_gpus_per_node='4', scheduler_ip_port='172.18.0.4:8786', use_gpu='True')
- unparsed:  []
- my rank is  0
- my ip is:  172.18.0.21
- n_gpus_per_node:  4
distributed.nanny - INFO -         Start Nanny at: 'tcp://172.18.0.21:34865'
distributed.nanny - INFO -         Start Nanny at: 'tcp://172.18.0.21:37105'
distributed.nanny - INFO -         Start Nanny at: 'tcp://172.18.0.21:46857'
distributed.nanny - INFO -         Start Nanny at: 'tcp://172.18.0.21:38133'
distributed.preloading - INFO - Import preload module: dask_cuda.initialize
Process Dask Worker process (from Nanny):
Process Dask Worker process (from Nanny):
Traceback (most recent call last):
  File "/opt/conda/envs/rapids/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
  File "/opt/conda/envs/rapids/lib/python3.7/multiprocessing/process.py", line 99, in run
    self._target(*self._args, **self._kwargs)
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/distributed/process.py", line 191, in _run
    target(*args, **kwargs)
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/distributed/nanny.py", line 666, in _run
    worker = Worker(**worker_kwargs)
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/distributed/worker.py", line 543, in __init__
    self.data = data[0](**data[1])
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/dask_cuda/device_host_file.py", line 141, in __init__
    self.disk_func = Func(serialize_bytelist, deserialize_bytes, File(path))
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/zict/file.py", line 63, in __init__
    os.mkdir(self.directory)
FileExistsError: [Errno 17] File exists: 'storage'
Traceback (most recent call last):
  File "/opt/conda/envs/rapids/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
  File "/opt/conda/envs/rapids/lib/python3.7/multiprocessing/process.py", line 99, in run
    self._target(*self._args, **self._kwargs)
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/distributed/process.py", line 191, in _run
    target(*args, **kwargs)
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/distributed/nanny.py", line 666, in _run
    worker = Worker(**worker_kwargs)
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/distributed/worker.py", line 543, in __init__
    self.data = data[0](**data[1])
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/dask_cuda/device_host_file.py", line 141, in __init__
    self.disk_func = Func(serialize_bytelist, deserialize_bytes, File(path))
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/zict/file.py", line 63, in __init__
    os.mkdir(self.directory)
FileExistsError: [Errno 17] File exists: 'storage'
distributed.preloading - INFO - Import preload module: dask_cuda.initialize
distributed.nanny - INFO - Worker process 248 exited with status 1
tornado.application - ERROR - Exception in callback functools.partial(<bound method IOLoop._discard_future_result of <tornado.platform.asyncio.AsyncIOMainLoop object at 0x7fb1ebb6c490>>, <Task finished coro=<Nanny._on_exit() done, defined at /opt/conda/envs/rapids/lib/python3.7/site-packages/distributed/nanny.py:387> exception=TypeError('addresses should be strings or tuples, got None')>)
Traceback (most recent call last):
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/tornado/ioloop.py", line 743, in _run_callback
    ret = callback()
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/tornado/ioloop.py", line 767, in _discard_future_result
    future.result()
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/distributed/nanny.py", line 390, in _on_exit
    await self.scheduler.unregister(address=self.worker_address)
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/distributed/core.py", line 757, in send_recv_from_rpc
    result = await send_recv(comm=comm, op=key, **kwargs)
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/distributed/core.py", line 556, in send_recv
    raise exc.with_traceback(tb)
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/distributed/core.py", line 408, in handle_comm
    result = handler(comm, **msg)
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/distributed/scheduler.py", line 2135, in remove_worker
    address = self.coerce_address(address)
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/distributed/scheduler.py", line 4844, in coerce_address
    raise TypeError("addresses should be strings or tuples, got %r" % (addr,))
TypeError: addresses should be strings or tuples, got None
distributed.nanny - INFO - Closing Nanny at 'tcp://172.18.0.21:46857'
distributed.worker - INFO -       Start worker at:    tcp://172.18.0.21:43685
distributed.worker - INFO -          Listening to:    tcp://172.18.0.21:43685
distributed.worker - INFO -          dashboard at:          172.18.0.21:34483
distributed.worker - INFO - Waiting to connect to:      tcp://172.18.0.4:8786
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -               Threads:                          1
distributed.worker - INFO -       Local Directory: /mnt/batch/tasks/shared/LS_root/jobs/todrabas_testing/azureml/todrabas-dask-benchmarks_1582564385_9b31e947/mounts/workspaceblobstore/azureml/todrabas-dask-benchmarks_1582564385_9b31e947/worker-sfamnyrv
distributed.worker - INFO - Starting Worker plugin <dask_cuda.utils.RMMPool object at 0x7f9a30f10710>-52a534a5-1f63-44f8-883b-29f22e1cc9ec
distributed.worker - INFO - Starting Worker plugin <dask_cuda.utils.CPUAffinity object at 0x7f9a30f89-8abaf39c-62f2-4dc6-94ed-a23d632db081
distributed.worker - INFO - -------------------------------------------------
distributed.nanny - INFO - Worker process 251 exited with status 1
tornado.application - ERROR - Exception in callback functools.partial(<bound method IOLoop._discard_future_result of <tornado.platform.asyncio.AsyncIOMainLoop object at 0x7fb1ebb6c490>>, <Task finished coro=<Nanny._on_exit() done, defined at /opt/conda/envs/rapids/lib/python3.7/site-packages/distributed/nanny.py:387> exception=TypeError('addresses should be strings or tuples, got None')>)
Traceback (most recent call last):
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/tornado/ioloop.py", line 743, in _run_callback
    ret = callback()
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/tornado/ioloop.py", line 767, in _discard_future_result
    future.result()
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/distributed/nanny.py", line 390, in _on_exit
    await self.scheduler.unregister(address=self.worker_address)
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/distributed/core.py", line 757, in send_recv_from_rpc
    result = await send_recv(comm=comm, op=key, **kwargs)
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/distributed/core.py", line 556, in send_recv
    raise exc.with_traceback(tb)
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/distributed/core.py", line 408, in handle_comm
    result = handler(comm, **msg)
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/distributed/scheduler.py", line 2135, in remove_worker
    address = self.coerce_address(address)
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/distributed/scheduler.py", line 4844, in coerce_address
    raise TypeError("addresses should be strings or tuples, got %r" % (addr,))
TypeError: addresses should be strings or tuples, got None
distributed.nanny - INFO - Closing Nanny at 'tcp://172.18.0.21:38133'
distributed.worker - INFO -         Registered to:      tcp://172.18.0.4:8786
distributed.worker - INFO - -------------------------------------------------
distributed.core - INFO - Starting established connection
distributed.worker - INFO -       Start worker at:    tcp://172.18.0.21:34807
distributed.worker - INFO -          Listening to:    tcp://172.18.0.21:34807
distributed.worker - INFO -          dashboard at:          172.18.0.21:40565
distributed.worker - INFO - Waiting to connect to:      tcp://172.18.0.4:8786
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -               Threads:                          1
distributed.worker - INFO -       Local Directory: /mnt/batch/tasks/shared/LS_root/jobs/todrabas_testing/azureml/todrabas-dask-benchmarks_1582564385_9b31e947/mounts/workspaceblobstore/azureml/todrabas-dask-benchmarks_1582564385_9b31e947/worker-9q2p2egq
distributed.worker - INFO - Starting Worker plugin <dask_cuda.utils.RMMPool object at 0x7f611473a410>-21b3a3b2-d964-42eb-a350-39ba21155f8b
distributed.worker - INFO - Starting Worker plugin <dask_cuda.utils.CPUAffinity object at 0x7f6114736-5dd82753-e3d4-4fc1-8d07-3c55f3b926df
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -         Registered to:      tcp://172.18.0.4:8786
distributed.worker - INFO - -------------------------------------------------
distributed.core - INFO - Starting established connection

These nodes can only register 2 workers instead of 4. Not sure how to mitigate this… Any suggestions welcome.

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:13 (5 by maintainers)

github_iconTop GitHub Comments

3reactions
pentschevcommented, Feb 24, 2020

Glad it worked @drabastomek !

2reactions
drabastomekcommented, Feb 24, 2020

Getting zict from GitHub solved the issue! Woot! Thanks for such a quick turnaround! Case closed!

Read more comments on GitHub >

github_iconTop Results From Across the Web

Python "FileExists" error when making directory - Stack Overflow
Description: Just checking if the directory already exist throws this error message [Errno 17] File exists because we are just checking if the...
Read more >
FileExistsError: [Errno 17] File exists: '/storage/data/oxford-iiit ...
Hi everyone, I've just started the book and I'm trying to get the first piece of code of chapter 1 to run in...
Read more >
'File exists' error received instead of 'permission denied' when ...
Issue. copying a file on a GFS2 file system gives an error message 'File exists' instead of 'Permission denied' message in case if...
Read more >
After manually delete a file, and use FileOutputStream to ...
After manually delete a file, and use FileOutputStream to create same name file, it will fail with "file exists" error?
Read more >
What is a "failed to create a symbolic link: file exists" error?
This is a classical error... it's the other way around: ln -s Existing-file New-name. so in your case
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found