FileExistsError: File exists: 'storage'
See original GitHub issueI have been seeing these quite often but the issues seem to be transient in nature – not all nodes experience this.
I’m running on Azure ML Compute, 20 nodes with 4 V100s with 16GB. Some of the nodes report this:
Starting the daemon thread to refresh tokens in background for process with pid = 224
Entering Run History Context Manager.
Preparing to call script [ start_worker.py ] with arguments: ['--scheduler_ip_port=172.18.0.4:8786', '--use_gpu=True', '--n_gpus_per_node=4']
After variable expansion, calling script [ start_worker.py ] with arguments: ['--scheduler_ip_port=172.18.0.4:8786', '--use_gpu=True', '--n_gpus_per_node=4']
- scheduler is 172.18.0.4:8786
- args: Namespace(n_gpus_per_node='4', scheduler_ip_port='172.18.0.4:8786', use_gpu='True')
- unparsed: []
- my rank is 0
- my ip is: 172.18.0.21
- n_gpus_per_node: 4
distributed.nanny - INFO - Start Nanny at: 'tcp://172.18.0.21:34865'
distributed.nanny - INFO - Start Nanny at: 'tcp://172.18.0.21:37105'
distributed.nanny - INFO - Start Nanny at: 'tcp://172.18.0.21:46857'
distributed.nanny - INFO - Start Nanny at: 'tcp://172.18.0.21:38133'
distributed.preloading - INFO - Import preload module: dask_cuda.initialize
Process Dask Worker process (from Nanny):
Process Dask Worker process (from Nanny):
Traceback (most recent call last):
File "/opt/conda/envs/rapids/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
self.run()
File "/opt/conda/envs/rapids/lib/python3.7/multiprocessing/process.py", line 99, in run
self._target(*self._args, **self._kwargs)
File "/opt/conda/envs/rapids/lib/python3.7/site-packages/distributed/process.py", line 191, in _run
target(*args, **kwargs)
File "/opt/conda/envs/rapids/lib/python3.7/site-packages/distributed/nanny.py", line 666, in _run
worker = Worker(**worker_kwargs)
File "/opt/conda/envs/rapids/lib/python3.7/site-packages/distributed/worker.py", line 543, in __init__
self.data = data[0](**data[1])
File "/opt/conda/envs/rapids/lib/python3.7/site-packages/dask_cuda/device_host_file.py", line 141, in __init__
self.disk_func = Func(serialize_bytelist, deserialize_bytes, File(path))
File "/opt/conda/envs/rapids/lib/python3.7/site-packages/zict/file.py", line 63, in __init__
os.mkdir(self.directory)
FileExistsError: [Errno 17] File exists: 'storage'
Traceback (most recent call last):
File "/opt/conda/envs/rapids/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
self.run()
File "/opt/conda/envs/rapids/lib/python3.7/multiprocessing/process.py", line 99, in run
self._target(*self._args, **self._kwargs)
File "/opt/conda/envs/rapids/lib/python3.7/site-packages/distributed/process.py", line 191, in _run
target(*args, **kwargs)
File "/opt/conda/envs/rapids/lib/python3.7/site-packages/distributed/nanny.py", line 666, in _run
worker = Worker(**worker_kwargs)
File "/opt/conda/envs/rapids/lib/python3.7/site-packages/distributed/worker.py", line 543, in __init__
self.data = data[0](**data[1])
File "/opt/conda/envs/rapids/lib/python3.7/site-packages/dask_cuda/device_host_file.py", line 141, in __init__
self.disk_func = Func(serialize_bytelist, deserialize_bytes, File(path))
File "/opt/conda/envs/rapids/lib/python3.7/site-packages/zict/file.py", line 63, in __init__
os.mkdir(self.directory)
FileExistsError: [Errno 17] File exists: 'storage'
distributed.preloading - INFO - Import preload module: dask_cuda.initialize
distributed.nanny - INFO - Worker process 248 exited with status 1
tornado.application - ERROR - Exception in callback functools.partial(<bound method IOLoop._discard_future_result of <tornado.platform.asyncio.AsyncIOMainLoop object at 0x7fb1ebb6c490>>, <Task finished coro=<Nanny._on_exit() done, defined at /opt/conda/envs/rapids/lib/python3.7/site-packages/distributed/nanny.py:387> exception=TypeError('addresses should be strings or tuples, got None')>)
Traceback (most recent call last):
File "/opt/conda/envs/rapids/lib/python3.7/site-packages/tornado/ioloop.py", line 743, in _run_callback
ret = callback()
File "/opt/conda/envs/rapids/lib/python3.7/site-packages/tornado/ioloop.py", line 767, in _discard_future_result
future.result()
File "/opt/conda/envs/rapids/lib/python3.7/site-packages/distributed/nanny.py", line 390, in _on_exit
await self.scheduler.unregister(address=self.worker_address)
File "/opt/conda/envs/rapids/lib/python3.7/site-packages/distributed/core.py", line 757, in send_recv_from_rpc
result = await send_recv(comm=comm, op=key, **kwargs)
File "/opt/conda/envs/rapids/lib/python3.7/site-packages/distributed/core.py", line 556, in send_recv
raise exc.with_traceback(tb)
File "/opt/conda/envs/rapids/lib/python3.7/site-packages/distributed/core.py", line 408, in handle_comm
result = handler(comm, **msg)
File "/opt/conda/envs/rapids/lib/python3.7/site-packages/distributed/scheduler.py", line 2135, in remove_worker
address = self.coerce_address(address)
File "/opt/conda/envs/rapids/lib/python3.7/site-packages/distributed/scheduler.py", line 4844, in coerce_address
raise TypeError("addresses should be strings or tuples, got %r" % (addr,))
TypeError: addresses should be strings or tuples, got None
distributed.nanny - INFO - Closing Nanny at 'tcp://172.18.0.21:46857'
distributed.worker - INFO - Start worker at: tcp://172.18.0.21:43685
distributed.worker - INFO - Listening to: tcp://172.18.0.21:43685
distributed.worker - INFO - dashboard at: 172.18.0.21:34483
distributed.worker - INFO - Waiting to connect to: tcp://172.18.0.4:8786
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO - Threads: 1
distributed.worker - INFO - Local Directory: /mnt/batch/tasks/shared/LS_root/jobs/todrabas_testing/azureml/todrabas-dask-benchmarks_1582564385_9b31e947/mounts/workspaceblobstore/azureml/todrabas-dask-benchmarks_1582564385_9b31e947/worker-sfamnyrv
distributed.worker - INFO - Starting Worker plugin <dask_cuda.utils.RMMPool object at 0x7f9a30f10710>-52a534a5-1f63-44f8-883b-29f22e1cc9ec
distributed.worker - INFO - Starting Worker plugin <dask_cuda.utils.CPUAffinity object at 0x7f9a30f89-8abaf39c-62f2-4dc6-94ed-a23d632db081
distributed.worker - INFO - -------------------------------------------------
distributed.nanny - INFO - Worker process 251 exited with status 1
tornado.application - ERROR - Exception in callback functools.partial(<bound method IOLoop._discard_future_result of <tornado.platform.asyncio.AsyncIOMainLoop object at 0x7fb1ebb6c490>>, <Task finished coro=<Nanny._on_exit() done, defined at /opt/conda/envs/rapids/lib/python3.7/site-packages/distributed/nanny.py:387> exception=TypeError('addresses should be strings or tuples, got None')>)
Traceback (most recent call last):
File "/opt/conda/envs/rapids/lib/python3.7/site-packages/tornado/ioloop.py", line 743, in _run_callback
ret = callback()
File "/opt/conda/envs/rapids/lib/python3.7/site-packages/tornado/ioloop.py", line 767, in _discard_future_result
future.result()
File "/opt/conda/envs/rapids/lib/python3.7/site-packages/distributed/nanny.py", line 390, in _on_exit
await self.scheduler.unregister(address=self.worker_address)
File "/opt/conda/envs/rapids/lib/python3.7/site-packages/distributed/core.py", line 757, in send_recv_from_rpc
result = await send_recv(comm=comm, op=key, **kwargs)
File "/opt/conda/envs/rapids/lib/python3.7/site-packages/distributed/core.py", line 556, in send_recv
raise exc.with_traceback(tb)
File "/opt/conda/envs/rapids/lib/python3.7/site-packages/distributed/core.py", line 408, in handle_comm
result = handler(comm, **msg)
File "/opt/conda/envs/rapids/lib/python3.7/site-packages/distributed/scheduler.py", line 2135, in remove_worker
address = self.coerce_address(address)
File "/opt/conda/envs/rapids/lib/python3.7/site-packages/distributed/scheduler.py", line 4844, in coerce_address
raise TypeError("addresses should be strings or tuples, got %r" % (addr,))
TypeError: addresses should be strings or tuples, got None
distributed.nanny - INFO - Closing Nanny at 'tcp://172.18.0.21:38133'
distributed.worker - INFO - Registered to: tcp://172.18.0.4:8786
distributed.worker - INFO - -------------------------------------------------
distributed.core - INFO - Starting established connection
distributed.worker - INFO - Start worker at: tcp://172.18.0.21:34807
distributed.worker - INFO - Listening to: tcp://172.18.0.21:34807
distributed.worker - INFO - dashboard at: 172.18.0.21:40565
distributed.worker - INFO - Waiting to connect to: tcp://172.18.0.4:8786
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO - Threads: 1
distributed.worker - INFO - Local Directory: /mnt/batch/tasks/shared/LS_root/jobs/todrabas_testing/azureml/todrabas-dask-benchmarks_1582564385_9b31e947/mounts/workspaceblobstore/azureml/todrabas-dask-benchmarks_1582564385_9b31e947/worker-9q2p2egq
distributed.worker - INFO - Starting Worker plugin <dask_cuda.utils.RMMPool object at 0x7f611473a410>-21b3a3b2-d964-42eb-a350-39ba21155f8b
distributed.worker - INFO - Starting Worker plugin <dask_cuda.utils.CPUAffinity object at 0x7f6114736-5dd82753-e3d4-4fc1-8d07-3c55f3b926df
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO - Registered to: tcp://172.18.0.4:8786
distributed.worker - INFO - -------------------------------------------------
distributed.core - INFO - Starting established connection
These nodes can only register 2 workers instead of 4. Not sure how to mitigate this… Any suggestions welcome.
Issue Analytics
- State:
- Created 4 years ago
- Comments:13 (5 by maintainers)
Top Results From Across the Web
Python "FileExists" error when making directory - Stack Overflow
Description: Just checking if the directory already exist throws this error message [Errno 17] File exists because we are just checking if the...
Read more >FileExistsError: [Errno 17] File exists: '/storage/data/oxford-iiit ...
Hi everyone, I've just started the book and I'm trying to get the first piece of code of chapter 1 to run in...
Read more >'File exists' error received instead of 'permission denied' when ...
Issue. copying a file on a GFS2 file system gives an error message 'File exists' instead of 'Permission denied' message in case if...
Read more >After manually delete a file, and use FileOutputStream to ...
After manually delete a file, and use FileOutputStream to create same name file, it will fail with "file exists" error?
Read more >What is a "failed to create a symbolic link: file exists" error?
This is a classical error... it's the other way around: ln -s Existing-file New-name. so in your case
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Glad it worked @drabastomek !
Getting zict from GitHub solved the issue! Woot! Thanks for such a quick turnaround! Case closed!