question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Temporary saved file already exists

See original GitHub issue

Hi,

Thank you for this amazing tool! I just started using it recently. I’m encountering some weird error and I was hoping you could help me fix it. Here is the error log:

submitit WARNING (2021-03-28 01:13:17,420) - Caught signal 15 on learnfair0463: this job is preempted.
slurmstepd: error: *** STEP 38544509.0 ON learnfair0463 CANCELLED AT 2021-03-28T01:13:17 DUE TO JOB REQUEUE ***
slurmstepd: error: *** JOB 38544509 ON learnfair0463 CANCELLED AT 2021-03-28T01:13:17 DUE TO JOB REQUEUE ***
submitit WARNING (2021-03-28 01:13:17,482) - Bypassing signal 18
submitit WARNING (2021-03-28 01:13:17,483) - Caught signal 15 on learnfair0463: this job is preempted.
38544484_16: Job is pending execution
submitit ERROR (2021-03-28 01:13:17,535) - Could not dump error:
Command '['scontrol', 'requeue', '38544484_16']' returned non-zero exit status 1.

because of A temporary saved file already exists.
submitit ERROR (2021-03-28 01:13:17,535) - Submitted job triggered an exception
Traceback (most recent call last):
  File "/private/home/sramakri/.conda/envs/ego4d/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/private/home/sramakri/.conda/envs/ego4d/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/private/home/sramakri/.conda/envs/ego4d/lib/python3.8/site-packages/submitit/core/_submit.py", line 11, in <module>
    submitit_main()
  File "/private/home/sramakri/.conda/envs/ego4d/lib/python3.8/site-packages/submitit/core/submission.py", line 71, in submitit_main
    process_job(args.folder)
  File "/private/home/sramakri/.conda/envs/ego4d/lib/python3.8/site-packages/submitit/core/submission.py", line 64, in process_job
    raise error
  File "/private/home/sramakri/.conda/envs/ego4d/lib/python3.8/site-packages/submitit/core/submission.py", line 55, in process_job
    utils.cloudpickle_dump(("success", result), tmppath)
  File "/private/home/sramakri/.conda/envs/ego4d/lib/python3.8/site-packages/submitit/core/utils.py", line 238, in cloudpickle_dump
    cloudpickle.dump(obj, ofile, pickle.HIGHEST_PROTOCOL)
  File "/private/home/sramakri/.conda/envs/ego4d/lib/python3.8/site-packages/submitit/core/job_environment.py", line 209, in checkpoint_and_try_requeue
    self.env._requeue(countdown)
  File "/private/home/sramakri/.conda/envs/ego4d/lib/python3.8/site-packages/submitit/slurm/slurm.py", line 193, in _requeue
    subprocess.check_call(["scontrol", "requeue", jid])
  File "/private/home/sramakri/.conda/envs/ego4d/lib/python3.8/subprocess.py", line 364, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['scontrol', 'requeue', '38544484_16']' returned non-zero exit status 1.
/bin/bash: /public/apps/anaconda3/2020.11/lib/libtinfo.so.6: no version information available (required by /bin/bash)
submitit ERROR (2021-03-28 01:35:36,155) - Could not dump error:
A temporary saved file already exists.

because of A temporary saved file already exists.
submitit ERROR (2021-03-28 01:35:36,156) - Submitted job triggered an exception
Traceback (most recent call last):
  File "/private/home/sramakri/.conda/envs/ego4d/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/private/home/sramakri/.conda/envs/ego4d/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/private/home/sramakri/.conda/envs/ego4d/lib/python3.8/site-packages/submitit/core/_submit.py", line 11, in <module>
    submitit_main()
  File "/private/home/sramakri/.conda/envs/ego4d/lib/python3.8/site-packages/submitit/core/submission.py", line 71, in submitit_main
    process_job(args.folder)
  File "/private/home/sramakri/.conda/envs/ego4d/lib/python3.8/site-packages/submitit/core/submission.py", line 64, in process_job
    raise error
  File "/private/home/sramakri/.conda/envs/ego4d/lib/python3.8/site-packages/submitit/core/submission.py", line 54, in process_job
    with utils.temporary_save_path(paths.result_pickle) as tmppath:  # save somewhere else, and move
  File "/private/home/sramakri/.conda/envs/ego4d/lib/python3.8/contextlib.py", line 113, in __enter__
    return next(self.gen)
  File "/private/home/sramakri/.conda/envs/ego4d/lib/python3.8/site-packages/submitit/core/utils.py", line 171, in temporary_save_path
    assert not tmppath.exists(), "A temporary saved file already exists."
AssertionError: A temporary saved file already exists.
srun: error: learnfair0292: task 0: Exited with exit code 1
srun: launch/slurm: _step_signal: Terminating StepId=38544509.1

My analysis of the error is as follows. The temporary save file error is thrown in process_job here. One possible reason why this could happen is if the tmppath was created previously in the try block, but there was a failure before the context ended.

This could happen either in the utils.cloudpickle_dump() call or in logger.info(). However, I can see a temporary save path 38544484_16_0_result.pkl.save_tmp that contains the following information ('success', None). So is the error with logger? Or am I completely off here?

I’m running a job array with 1024 jobs and 128 slurm_array_parallelism. The code run by the jobs actually completed and the results were saved. So I don’t think this is an error in the python function I ran.

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:7 (2 by maintainers)

github_iconTop GitHub Comments

1reaction
jrapincommented, Apr 8, 2021

I think this may be related to the signal handling changes made in v3.0. I’m using v3.0 now and this problem is not there anymore (using the same jobs as before). I’ll close it then. Thanks!

Update to 1.3.3, which should be better and work for FAIR cluster now.

Any ideas on the cause that I could try to emulate?

on my side absolutely none, from what you had said I tried to submit a job within a temporary directory context, but that did not change anything 😒

0reactions
srama2512commented, Apr 8, 2021

I think this may be related to the signal handling changes made in v1.3.0. I’m using v1.3.1 now and this problem is not there anymore (using the same jobs as before). I’ll close it then. Thanks!

Read more comments on GitHub >

github_iconTop Results From Across the Web

RE: st: tempfile already exists - Stata
As an aside to this post, I learned the hard way a few months ago why - save `tempfile', replace - should be...
Read more >
Error: Temporary file already exists - ExifTool by Phil Harvey
ExifTool will not overwrite the temporary file if it already exists (in case you had a real file with this name). You must...
Read more >
Workflow Not Saved - file already exists? - Alteryx Community
It's attempting to save as a temp file in the location indicated in the red warning note. Simply follow that path and delete...
Read more >
Description of how Word creates temporary files
The temporary files only exist during the current session of Word. When Word is shut ... FastSave, for example, merges these two files...
Read more >
How to Recover Excel Temp File Location Cannot be Found in ...
1. Where does excel save temp files? Excel's temporary files are stored by default in the following directory: C:\Users\AppData\Local\Microsoft\ ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found