HPC cluster workers preload file not found
See original GitHub issueI’m launching multiple (in this case 40, but it doesn’t matter) DASK workers at the same time on a HPC using a job scheduler. They start running almost at the same time and are preloading the same dask_preload.py file. Some (approximately half of them) workers die immediately with the following error, the others start up correctly:
Traceback (most recent call last):
File "/python_virtualenv/bin/dask-worker", line 11, in <module>
sys.exit(go())
File "/python_virtualenv/lib/python2.7/site-packages/distributed/cli/dask_worker.py", line 252, in go
main()
File "/python_virtualenv/lib/python2.7/site-packages/click/core.py", line 722, in __call__
return self.main(*args, **kwargs)
File "/python_virtualenv/lib/python2.7/site-packages/click/core.py", line 696, in main
with self.make_context(prog_name, args, **extra) as ctx:
File "/python_virtualenv/lib/python2.7/site-packages/click/core.py", line 621, in make_context
self.parse_args(ctx, args)
File "/python_virtualenv/lib/python2.7/site-packages/click/core.py", line 880, in parse_args
value, args = param.handle_parse_result(ctx, opts, args)
File "/python_virtualenv/lib/python2.7/site-packages/click/core.py", line 1404, in handle_parse_result
self.callback, ctx, self, value)
File "/python_virtualenv/lib/python2.7/site-packages/click/core.py", line 78, in invoke_param_callback
return callback(ctx, param, value)
File "/python_virtualenv/lib/python2.7/site-packages/distributed/preloading.py", line 32, in validate_preload_argv
preload_modules = _import_modules(ctx.params.get("preload"))
File "/python_virtualenv/lib/python2.7/site-packages/distributed/preloading.py", line 92, in _import_modules
module = import_file(name)[0]
File "/python_virtualenv/lib/python2.7/site-packages/distributed/utils.py", line 1003, in import_file
os.remove(cache_file)
OSError: [Errno 2] No such file or directory: 'dask_preload.pyc'
I can schedule workers to start 1 at a time. But first I want to know if it is possible to fix this within the scheduler code itself?
Btw. If I start 1 worker, it is able to load the dask_preload.py file.
Issue Analytics
- State:
- Created 5 years ago
- Comments:11 (9 by maintainers)
Top Results From Across the Web
ERROR: ld.so: object ''libVT.so' from LD_PRELOAD cannot be ...
Hello George,. If .stf file is created it means that ITAC worked just fine and was able to collect trace. I would guess...
Read more >Fail to run system coupling for 2020 r1 and r2 on HPC cluster
In the first error you sent, it says line 14: module: command not found and line 17: system coupling: command not found. This...
Read more >CernVM-FS on Supercomputers - Read the Docs
Nodes have no local hard disk to store the CernVM-FS cache ... utility can be used to preload a CernVM-FS cache onto the...
Read more >HPC - IOPscience
The worker layer is executed on the compute nodes (or intermediate “mom” nodes depending on the cluster) and provides the MPI interface. It...
Read more >Introduction to High-Performance Computing
One important class of Supercomputers are called HPC clusters.. An HPC cluster is made ... If the file does not exist it will...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Yes, what @pitrou proposes is an easy fix. I have a fix locally that I’ll push up in a bit.
On Tue, Apr 17, 2018 at 2:08 AM, Wouter van Roosmalen < notifications@github.com> wrote:
This looks like a benign case of race condition in https://github.com/dask/distributed/blob/master/distributed/utils.py#L1000-L1003.
Just replace:
with:
(you can refine the exception check by querying errno)