`.repo_lock.softlock` could not be acquired when multiple slurm jobs are started
See original GitHub issue🐛 Bug
When sbatch --ntasks n
with n
larger than 1, there is race between multiple threads, each of which wants to create a run. Sometimes the race will cause the error of .repo_lock.softlock
could not be acquired.
To reproduce
Run the slurm command sbatch --ntask 2
for a python script, and it might cause an error when creating a run run=Run()
:
File "/home/twni2016/env/lib/python3.10/site-packages/aim/sdk/run.py", line 287, in __init__
super().__init__(run_hash, repo=repo, read_only=read_only)
File "/home/twni2016/env/lib/python3.10/site-packages/aim/sdk/base_run.py", line 31, in __init__
super().__init__(run_hash, repo=repo, read_only=read_only)
File "/home/twni2016/env/lib/python3.10/site-packages/aim/sdk/base_run.py", line 31, in __init__
self.repo = get_repo(repo)
File "/home/twni2016/env/lib/python3.10/site-packages/aim/sdk/repo_utils.py", line 25, in get_repo
self.repo = get_repo(repo)
File "/home/twni2016/env/lib/python3.10/site-packages/aim/sdk/repo_utils.py", line 25, in get_repo
repo = Repo.from_path(repo)
File "/home/twni2016/env/lib/python3.10/site-packages/aim/sdk/repo.py", line 209, in from_path
repo = Repo.from_path(repo)
File "/home/twni2016/env/lib/python3.10/site-packages/aim/sdk/repo.py", line 209, in from_path
repo = Repo(path, read_only=read_only, init=init)
File "/home/twni2016/env/lib/python3.10/site-packages/aim/sdk/repo.py", line 138, in __init__
with self.lock():
File "/cvmfs/soft.computecanada.ca/easybuild/software/2020/avx2/Core/python/3.10.2/lib/python3.10/contextlib.py", line 135, in __enter__
repo = Repo(path, read_only=read_only, init=init)
File "/home/twni2016/env/lib/python3.10/site-packages/aim/sdk/repo.py", line 138, in __init__
return next(self.gen)
File "/home/twni2016/env/lib/python3.10/site-packages/aim/sdk/repo.py", line 165, in lock
with self.lock():
File "/cvmfs/soft.computecanada.ca/easybuild/software/2020/avx2/Core/python/3.10.2/lib/python3.10/contextlib.py", line 135, in __enter__
return next(self.gen)
File "/home/twni2016/env/lib/python3.10/site-packages/aim/sdk/repo.py", line 165, in lock
self._lock.acquire()
File "/home/twni2016/env/lib/python3.10/site-packages/filelock/_api.py", line 183, in acquire
self._lock.acquire()
File "/home/twni2016/env/lib/python3.10/site-packages/filelock/_api.py", line 183, in acquire
raise Timeout(self._lock_file)
raise Timeout(self._lock_file)
filelock._error.Timeout: The file lock '*/.aim/.repo_lock.softlock' could not be acquired.
filelock._error.Timeout: The file lock '*/.aim/.repo_lock.softlock' could not be acquired.
Expected behavior
No such error.
Environment
- Aim Version 3.14.1
- Python version 3.10
- pip version 22.2.2
- OS Linux
Issue Analytics
- State:
- Created a year ago
- Comments:6 (3 by maintainers)
Top Results From Across the Web
Frequently Asked Questions - Slurm Workload Manager
Why does the srun --overcommit option not permit multiple jobs to run on nodes? ... Can I change my job's size after it...
Read more >Building pipelines using slurm dependencies - NIH HPC
The Slurm batch script 'jobscript' uses the environment variable $SLURM_NTASKS to specify the number of MPI processes that the program should ...
Read more >When Will My Job Start? - Research IT
Do you want to know why your job is not running, when it might start, or what you ... this can occur when...
Read more >Slurm guide for multiple queue mode - AWS ParallelCluster
AWS ParallelCluster version 2.9.0 introduced multiple queue mode and a new scaling architecture for ... However, Slurm can still allocate jobs to the...
Read more >SLURM: Scheduling and Managing Jobs | ACCRE
This command will execute and then wait for the allocation to be obtained. Once the allocation is granted, an interactive shell is initiated...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
@twni2016 ah, my bad, for a moment I thought ntasks executes parallel threads which try to write to the same run.
Most probably this is purely related to the issue with runs locking mechanism, the fix is in-progress now. A similar issue was reported last week as well. The plan is to release a patch fix in the coming days. I will let you know, once the fix is shipped.
@twni2016 Aim has a natural limitation which restricts to write to the same run from multiple parallel clients. It seems
ntasks
argument runs parallel threads, which is causing the issue. Is there a way in slurm to configure the workflow to initialize the aim.Run only once and then use it as a shared resource between the threads?@alberttorosyan tagging you so you are aware of this thread.