question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

HPC cluster workers preload file not found

See original GitHub issue

I’m launching multiple (in this case 40, but it doesn’t matter) DASK workers at the same time on a HPC using a job scheduler. They start running almost at the same time and are preloading the same dask_preload.py file. Some (approximately half of them) workers die immediately with the following error, the others start up correctly:

Traceback (most recent call last):
  File "/python_virtualenv/bin/dask-worker", line 11, in <module>
    sys.exit(go())
  File "/python_virtualenv/lib/python2.7/site-packages/distributed/cli/dask_worker.py", line 252, in go
    main()
  File "/python_virtualenv/lib/python2.7/site-packages/click/core.py", line 722, in __call__
    return self.main(*args, **kwargs)
  File "/python_virtualenv/lib/python2.7/site-packages/click/core.py", line 696, in main
    with self.make_context(prog_name, args, **extra) as ctx:
  File "/python_virtualenv/lib/python2.7/site-packages/click/core.py", line 621, in make_context
    self.parse_args(ctx, args)
  File "/python_virtualenv/lib/python2.7/site-packages/click/core.py", line 880, in parse_args
    value, args = param.handle_parse_result(ctx, opts, args)
  File "/python_virtualenv/lib/python2.7/site-packages/click/core.py", line 1404, in handle_parse_result
    self.callback, ctx, self, value)
  File "/python_virtualenv/lib/python2.7/site-packages/click/core.py", line 78, in invoke_param_callback
    return callback(ctx, param, value)
  File "/python_virtualenv/lib/python2.7/site-packages/distributed/preloading.py", line 32, in validate_preload_argv
    preload_modules = _import_modules(ctx.params.get("preload"))
  File "/python_virtualenv/lib/python2.7/site-packages/distributed/preloading.py", line 92, in _import_modules
    module = import_file(name)[0]
  File "/python_virtualenv/lib/python2.7/site-packages/distributed/utils.py", line 1003, in import_file
    os.remove(cache_file)
OSError: [Errno 2] No such file or directory: 'dask_preload.pyc'

I can schedule workers to start 1 at a time. But first I want to know if it is possible to fix this within the scheduler code itself?

Btw. If I start 1 worker, it is able to load the dask_preload.py file.

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Comments:11 (9 by maintainers)

github_iconTop GitHub Comments

1reaction
mrocklincommented, Apr 17, 2018

Yes, what @pitrou proposes is an easy fix. I have a fix locally that I’ll push up in a bit.

On Tue, Apr 17, 2018 at 2:08 AM, Wouter van Roosmalen < notifications@github.com> wrote:

Is this something to change in a next release? Or a custom change specific to this use case?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/dask/distributed/issues/1915#issuecomment-381857309, or mute the thread https://github.com/notifications/unsubscribe-auth/AASszBv1_dc_eQuI2YmpXk1QmO9aa9Ifks5tpYbwgaJpZM4TRkBV .

1reaction
pitroucommented, Apr 16, 2018

This looks like a benign case of race condition in https://github.com/dask/distributed/blob/master/distributed/utils.py#L1000-L1003.

Just replace:

if os.path.exists(cache_file):
    os.remove(cache_file)

with:

try:
    os.remove(cache_file)
except OSError:
    pass  # ignore "file not found"

(you can refine the exception check by querying errno)

Read more comments on GitHub >

github_iconTop Results From Across the Web

ERROR: ld.so: object ''libVT.so' from LD_PRELOAD cannot be ...
Hello George,. If .stf file is created it means that ITAC worked just fine and was able to collect trace. I would guess...
Read more >
Fail to run system coupling for 2020 r1 and r2 on HPC cluster
In the first error you sent, it says line 14: module: command not found and line 17: system coupling: command not found. This...
Read more >
CernVM-FS on Supercomputers - Read the Docs
Nodes have no local hard disk to store the CernVM-FS cache ... utility can be used to preload a CernVM-FS cache onto the...
Read more >
HPC - IOPscience
The worker layer is executed on the compute nodes (or intermediate “mom” nodes depending on the cluster) and provides the MPI interface. It...
Read more >
Introduction to High-Performance Computing
One important class of Supercomputers are called HPC clusters.. An HPC cluster is made ... If the file does not exist it will...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found