Temporary saved file already exists
See original GitHub issueHi,
Thank you for this amazing tool! I just started using it recently. I’m encountering some weird error and I was hoping you could help me fix it. Here is the error log:
submitit WARNING (2021-03-28 01:13:17,420) - Caught signal 15 on learnfair0463: this job is preempted.
slurmstepd: error: *** STEP 38544509.0 ON learnfair0463 CANCELLED AT 2021-03-28T01:13:17 DUE TO JOB REQUEUE ***
slurmstepd: error: *** JOB 38544509 ON learnfair0463 CANCELLED AT 2021-03-28T01:13:17 DUE TO JOB REQUEUE ***
submitit WARNING (2021-03-28 01:13:17,482) - Bypassing signal 18
submitit WARNING (2021-03-28 01:13:17,483) - Caught signal 15 on learnfair0463: this job is preempted.
38544484_16: Job is pending execution
submitit ERROR (2021-03-28 01:13:17,535) - Could not dump error:
Command '['scontrol', 'requeue', '38544484_16']' returned non-zero exit status 1.
because of A temporary saved file already exists.
submitit ERROR (2021-03-28 01:13:17,535) - Submitted job triggered an exception
Traceback (most recent call last):
File "/private/home/sramakri/.conda/envs/ego4d/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/private/home/sramakri/.conda/envs/ego4d/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/private/home/sramakri/.conda/envs/ego4d/lib/python3.8/site-packages/submitit/core/_submit.py", line 11, in <module>
submitit_main()
File "/private/home/sramakri/.conda/envs/ego4d/lib/python3.8/site-packages/submitit/core/submission.py", line 71, in submitit_main
process_job(args.folder)
File "/private/home/sramakri/.conda/envs/ego4d/lib/python3.8/site-packages/submitit/core/submission.py", line 64, in process_job
raise error
File "/private/home/sramakri/.conda/envs/ego4d/lib/python3.8/site-packages/submitit/core/submission.py", line 55, in process_job
utils.cloudpickle_dump(("success", result), tmppath)
File "/private/home/sramakri/.conda/envs/ego4d/lib/python3.8/site-packages/submitit/core/utils.py", line 238, in cloudpickle_dump
cloudpickle.dump(obj, ofile, pickle.HIGHEST_PROTOCOL)
File "/private/home/sramakri/.conda/envs/ego4d/lib/python3.8/site-packages/submitit/core/job_environment.py", line 209, in checkpoint_and_try_requeue
self.env._requeue(countdown)
File "/private/home/sramakri/.conda/envs/ego4d/lib/python3.8/site-packages/submitit/slurm/slurm.py", line 193, in _requeue
subprocess.check_call(["scontrol", "requeue", jid])
File "/private/home/sramakri/.conda/envs/ego4d/lib/python3.8/subprocess.py", line 364, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['scontrol', 'requeue', '38544484_16']' returned non-zero exit status 1.
/bin/bash: /public/apps/anaconda3/2020.11/lib/libtinfo.so.6: no version information available (required by /bin/bash)
submitit ERROR (2021-03-28 01:35:36,155) - Could not dump error:
A temporary saved file already exists.
because of A temporary saved file already exists.
submitit ERROR (2021-03-28 01:35:36,156) - Submitted job triggered an exception
Traceback (most recent call last):
File "/private/home/sramakri/.conda/envs/ego4d/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/private/home/sramakri/.conda/envs/ego4d/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/private/home/sramakri/.conda/envs/ego4d/lib/python3.8/site-packages/submitit/core/_submit.py", line 11, in <module>
submitit_main()
File "/private/home/sramakri/.conda/envs/ego4d/lib/python3.8/site-packages/submitit/core/submission.py", line 71, in submitit_main
process_job(args.folder)
File "/private/home/sramakri/.conda/envs/ego4d/lib/python3.8/site-packages/submitit/core/submission.py", line 64, in process_job
raise error
File "/private/home/sramakri/.conda/envs/ego4d/lib/python3.8/site-packages/submitit/core/submission.py", line 54, in process_job
with utils.temporary_save_path(paths.result_pickle) as tmppath: # save somewhere else, and move
File "/private/home/sramakri/.conda/envs/ego4d/lib/python3.8/contextlib.py", line 113, in __enter__
return next(self.gen)
File "/private/home/sramakri/.conda/envs/ego4d/lib/python3.8/site-packages/submitit/core/utils.py", line 171, in temporary_save_path
assert not tmppath.exists(), "A temporary saved file already exists."
AssertionError: A temporary saved file already exists.
srun: error: learnfair0292: task 0: Exited with exit code 1
srun: launch/slurm: _step_signal: Terminating StepId=38544509.1
My analysis of the error is as follows. The temporary save file error is thrown in process_job
here. One possible reason why this could happen is if the tmppath
was created previously in the try
block, but there was a failure before the context ended.
This could happen either in the utils.cloudpickle_dump()
call or in logger.info()
. However, I can see a temporary save path 38544484_16_0_result.pkl.save_tmp
that contains the following information ('success', None)
. So is the error with logger? Or am I completely off here?
I’m running a job array with 1024 jobs and 128 slurm_array_parallelism
. The code run by the jobs actually completed and the results were saved. So I don’t think this is an error in the python function I ran.
Issue Analytics
- State:
- Created 2 years ago
- Comments:7 (2 by maintainers)
Top GitHub Comments
Update to 1.3.3, which should be better and work for FAIR cluster now.
on my side absolutely none, from what you had said I tried to submit a job within a temporary directory context, but that did not change anything 😒
I think this may be related to the signal handling changes made in v1.3.0. I’m using v1.3.1 now and this problem is not there anymore (using the same jobs as before). I’ll close it then. Thanks!