Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

`.repo_lock.softlock` could not be acquired when multiple slurm jobs are started

See original GitHub issue

🐛 Bug

When sbatch --ntasks n with n larger than 1, there is race between multiple threads, each of which wants to create a run. Sometimes the race will cause the error of .repo_lock.softlock could not be acquired.

To reproduce

Run the slurm command sbatch --ntask 2 for a python script, and it might cause an error when creating a run run=Run():

  File "/home/twni2016/env/lib/python3.10/site-packages/aim/sdk/run.py", line 287, in __init__
    super().__init__(run_hash, repo=repo, read_only=read_only)
  File "/home/twni2016/env/lib/python3.10/site-packages/aim/sdk/base_run.py", line 31, in __init__
    super().__init__(run_hash, repo=repo, read_only=read_only)
  File "/home/twni2016/env/lib/python3.10/site-packages/aim/sdk/base_run.py", line 31, in __init__
    self.repo = get_repo(repo)
  File "/home/twni2016/env/lib/python3.10/site-packages/aim/sdk/repo_utils.py", line 25, in get_repo
    self.repo = get_repo(repo)
  File "/home/twni2016/env/lib/python3.10/site-packages/aim/sdk/repo_utils.py", line 25, in get_repo
    repo = Repo.from_path(repo)
  File "/home/twni2016/env/lib/python3.10/site-packages/aim/sdk/repo.py", line 209, in from_path
    repo = Repo.from_path(repo)
  File "/home/twni2016/env/lib/python3.10/site-packages/aim/sdk/repo.py", line 209, in from_path
    repo = Repo(path, read_only=read_only, init=init)
  File "/home/twni2016/env/lib/python3.10/site-packages/aim/sdk/repo.py", line 138, in __init__
    with self.lock():
  File "/cvmfs/soft.computecanada.ca/easybuild/software/2020/avx2/Core/python/3.10.2/lib/python3.10/contextlib.py", line 135, in __enter__
    repo = Repo(path, read_only=read_only, init=init)
  File "/home/twni2016/env/lib/python3.10/site-packages/aim/sdk/repo.py", line 138, in __init__
    return next(self.gen)
  File "/home/twni2016/env/lib/python3.10/site-packages/aim/sdk/repo.py", line 165, in lock
    with self.lock():
  File "/cvmfs/soft.computecanada.ca/easybuild/software/2020/avx2/Core/python/3.10.2/lib/python3.10/contextlib.py", line 135, in __enter__
    return next(self.gen)
  File "/home/twni2016/env/lib/python3.10/site-packages/aim/sdk/repo.py", line 165, in lock
    self._lock.acquire()
  File "/home/twni2016/env/lib/python3.10/site-packages/filelock/_api.py", line 183, in acquire
    self._lock.acquire()
  File "/home/twni2016/env/lib/python3.10/site-packages/filelock/_api.py", line 183, in acquire
    raise Timeout(self._lock_file)
    raise Timeout(self._lock_file)
filelock._error.Timeout: The file lock '*/.aim/.repo_lock.softlock' could not be acquired.
filelock._error.Timeout: The file lock '*/.aim/.repo_lock.softlock' could not be acquired.

Expected behavior

No such error.

Environment

Aim Version 3.14.1
Python version 3.10
pip version 22.2.2
OS Linux

Issue Analytics

State:
Created a year ago
Comments:6 (3 by maintainers)

Top GitHub Comments

1reaction

gorarakelyancommented, Oct 24, 2022

@twni2016 ah, my bad, for a moment I thought ntasks executes parallel threads which try to write to the same run.

Most probably this is purely related to the issue with runs locking mechanism, the fix is in-progress now. A similar issue was reported last week as well. The plan is to release a patch fix in the coming days. I will let you know, once the fix is shipped.

1reaction

gorarakelyancommented, Oct 16, 2022

@twni2016 Aim has a natural limitation which restricts to write to the same run from multiple parallel clients. It seems ntasks argument runs parallel threads, which is causing the issue. Is there a way in slurm to configure the workflow to initialize the aim.Run only once and then use it as a shared resource between the threads?

@alberttorosyan tagging you so you are aware of this thread.

Top Results From Across the Web

Frequently Asked Questions - Slurm Workload Manager

Why does the srun --overcommit option not permit multiple jobs to run on nodes? ... Can I change my job's size after it...

Building pipelines using slurm dependencies - NIH HPC

The Slurm batch script 'jobscript' uses the environment variable $SLURM_NTASKS to specify the number of MPI processes that the program should ...

When Will My Job Start? - Research IT

Do you want to know why your job is not running, when it might start, or what you ... this can occur when...

Slurm guide for multiple queue mode - AWS ParallelCluster

AWS ParallelCluster version 2.9.0 introduced multiple queue mode and a new scaling architecture for ... However, Slurm can still allocate jobs to the...

SLURM: Scheduling and Managing Jobs | ACCRE

This command will execute and then wait for the allocation to be obtained. Once the allocation is granted, an interactive shell is initiated...