question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

`.repo_lock.softlock` could not be acquired when multiple slurm jobs are started

See original GitHub issue

🐛 Bug

When sbatch --ntasks n with n larger than 1, there is race between multiple threads, each of which wants to create a run. Sometimes the race will cause the error of .repo_lock.softlock could not be acquired.

To reproduce

Run the slurm command sbatch --ntask 2 for a python script, and it might cause an error when creating a run run=Run():

  File "/home/twni2016/env/lib/python3.10/site-packages/aim/sdk/run.py", line 287, in __init__
    super().__init__(run_hash, repo=repo, read_only=read_only)
  File "/home/twni2016/env/lib/python3.10/site-packages/aim/sdk/base_run.py", line 31, in __init__
    super().__init__(run_hash, repo=repo, read_only=read_only)
  File "/home/twni2016/env/lib/python3.10/site-packages/aim/sdk/base_run.py", line 31, in __init__
    self.repo = get_repo(repo)
  File "/home/twni2016/env/lib/python3.10/site-packages/aim/sdk/repo_utils.py", line 25, in get_repo
    self.repo = get_repo(repo)
  File "/home/twni2016/env/lib/python3.10/site-packages/aim/sdk/repo_utils.py", line 25, in get_repo
    repo = Repo.from_path(repo)
  File "/home/twni2016/env/lib/python3.10/site-packages/aim/sdk/repo.py", line 209, in from_path
    repo = Repo.from_path(repo)
  File "/home/twni2016/env/lib/python3.10/site-packages/aim/sdk/repo.py", line 209, in from_path
    repo = Repo(path, read_only=read_only, init=init)
  File "/home/twni2016/env/lib/python3.10/site-packages/aim/sdk/repo.py", line 138, in __init__
    with self.lock():
  File "/cvmfs/soft.computecanada.ca/easybuild/software/2020/avx2/Core/python/3.10.2/lib/python3.10/contextlib.py", line 135, in __enter__
    repo = Repo(path, read_only=read_only, init=init)
  File "/home/twni2016/env/lib/python3.10/site-packages/aim/sdk/repo.py", line 138, in __init__
    return next(self.gen)
  File "/home/twni2016/env/lib/python3.10/site-packages/aim/sdk/repo.py", line 165, in lock
    with self.lock():
  File "/cvmfs/soft.computecanada.ca/easybuild/software/2020/avx2/Core/python/3.10.2/lib/python3.10/contextlib.py", line 135, in __enter__
    return next(self.gen)
  File "/home/twni2016/env/lib/python3.10/site-packages/aim/sdk/repo.py", line 165, in lock
    self._lock.acquire()
  File "/home/twni2016/env/lib/python3.10/site-packages/filelock/_api.py", line 183, in acquire
    self._lock.acquire()
  File "/home/twni2016/env/lib/python3.10/site-packages/filelock/_api.py", line 183, in acquire
    raise Timeout(self._lock_file)
    raise Timeout(self._lock_file)
filelock._error.Timeout: The file lock '*/.aim/.repo_lock.softlock' could not be acquired.
filelock._error.Timeout: The file lock '*/.aim/.repo_lock.softlock' could not be acquired.

Expected behavior

No such error.

Environment

  • Aim Version 3.14.1
  • Python version 3.10
  • pip version 22.2.2
  • OS Linux

Issue Analytics

  • State:open
  • Created a year ago
  • Comments:6 (3 by maintainers)

github_iconTop GitHub Comments

1reaction
gorarakelyancommented, Oct 24, 2022

@twni2016 ah, my bad, for a moment I thought ntasks executes parallel threads which try to write to the same run.

Most probably this is purely related to the issue with runs locking mechanism, the fix is in-progress now. A similar issue was reported last week as well. The plan is to release a patch fix in the coming days. I will let you know, once the fix is shipped.

1reaction
gorarakelyancommented, Oct 16, 2022

@twni2016 Aim has a natural limitation which restricts to write to the same run from multiple parallel clients. It seems ntasks argument runs parallel threads, which is causing the issue. Is there a way in slurm to configure the workflow to initialize the aim.Run only once and then use it as a shared resource between the threads?

@alberttorosyan tagging you so you are aware of this thread.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Frequently Asked Questions - Slurm Workload Manager
Why does the srun --overcommit option not permit multiple jobs to run on nodes? ... Can I change my job's size after it...
Read more >
Building pipelines using slurm dependencies - NIH HPC
The Slurm batch script 'jobscript' uses the environment variable $SLURM_NTASKS to specify the number of MPI processes that the program should ...
Read more >
When Will My Job Start? - Research IT
Do you want to know why your job is not running, when it might start, or what you ... this can occur when...
Read more >
Slurm guide for multiple queue mode - AWS ParallelCluster
AWS ParallelCluster version 2.9.0 introduced multiple queue mode and a new scaling architecture for ... However, Slurm can still allocate jobs to the...
Read more >
SLURM: Scheduling and Managing Jobs | ACCRE
This command will execute and then wait for the allocation to be obtained. Once the allocation is granted, an interactive shell is initiated...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found