Exiting the batchSystem error
See original GitHub issueI have decided to use --workDir WORKDIR
and --cleanWorkDir never
for debugging purposes when running Cactus on an LSF-based system. WORKDIR
is a shared filesystem between LSF nodes.
I haven’t captured this sort of error before when the options above have been used with default values. I have no idea why FileNotFoundError
was raised when the job was exiting the batch system after changing the option values.
The job has been successfully executed. The error is only at the exit step of the batch system.
Any idea?
Cheers!
Job failed with exit value 1: 'RoundedJob' kind-RoundedJob/instance-w5vahj8b
No log file is present, despite job failing: 'RoundedJob' kind-RoundedJob/instance-w5vahj8b
The batch system left a non-empty file /hps/nobackup/production/ensembl/thiagogenez/pairwises/bemisia/run/workdir/toil_workflow_0499d6b7-bdc1-48a7-bf10-d9f94489af2a_job_110_batch_lsf_3036237_std_output.log:
Log from job "kind-RoundedJob/instance-w5vahj8b" follows:
=========>
------------------------------------------------------------
Sender: LSF System <lsf@hx-noah-04-16>
Subject: Job 3036237: <toil_job_110> in cluster <EBI> Exited
Job <toil_job_110> was submitted from host <noah-login-01> by user <thiagogenez> in cluster <EBI> at Fri Mar 19 00:49:18 2021
Job was executed on host(s) <hx-noah-04-16>, in queue <research-rh74>, as user <thiagogenez> in cluster <EBI> at Fri Mar 19 00:49:18 2021
</homes/thiagogenez> was used as the home directory.
</hps/nobackup/production/ensembl/thiagogenez/pairwises/bemisia/run/.> was used as the working directory.
Started at Fri Mar 19 00:49:18 2021
Terminated at Fri Mar 19 00:49:22 2021
Results reported at Fri Mar 19 00:49:22 2021
Your job looked like:
------------------------------------------------------------
# LSBATCH: User input
_toil_worker RoundedJob file:/hps/nobackup/production/ensembl/thiagogenez/pairwises/bemisia/run/jobstore/2 kind-RoundedJob/instance-w5vahj8b --context gASV/wAAAAAAAACMJXRvaWwuYmF0Y2hTeXN0ZW1zLmFic3RyYWN0QmF0Y2hTeXN0ZW2UjBRXb3JrZXJDbGVhbnVwQ29udGV4dJSTlCmBlH2UKIwRd29ya2VyQ2xlYW51cEluZm+UaACMEVdvcmtlckNsZWFudXBJbmZvlJOUjEovaHBzL25vYmFja3VwL3Byb2R1Y3Rpb24vZW5zZW1ibC90aGlhZ29nZW5lei9wYWlyd2lzZXMvYmVtaXNpYS9ydW4vd29ya2RpcpSMJDA0OTlkNmI3LWJkYzEtNDhhNy1iZjEwLWQ5Zjk0NDg5YWYyYZSMBW5ldmVylIeUgZSMBWFyZW5hlE51Yi4=
------------------------------------------------------------
Exited with exit code 1.
Resource usage summary:
CPU time : 1.72 sec.
Max Memory : 81 MB
Average Memory : 75.00 MB
Total Requested Memory : 2048.00 MB
Delta Memory : 1967.00 MB
Max Swap : 229 MB
Max Processes : 3
Max Threads : 4
Run time : 4 sec.
Turnaround time : 4 sec.
The output (if any) is above this job summary.
PS:
Read file </hps/nobackup/production/ensembl/thiagogenez/pairwises/bemisia/run/workdir/toil_workflow_0499d6b7-bdc1-48a7-bf10-d9f94489af2a_job_110_batch_lsf_3036237_std_error.log> for stderr output of this job.
<=========
The batch system left a non-empty file /hps/nobackup/production/ensembl/thiagogenez/pairwises/bemisia/run/workdir/toil_workflow_0499d6b7-bdc1-48a7-bf10-d9f94489af2a_job_110_batch_lsf_3036237_std_error.log:
Log from job "kind-RoundedJob/instance-w5vahj8b" follows:
=========>
[2021-03-19T00:49:21+0000] [MainThread] [D] [toil.statsAndLogging] Suppressing the following loggers: {'boto', 'google', 'rdflib', 'boto3', 'oauthlib', 'requests', 'bcdocs', 'botocore', 'cachecontrol', 'concurrent', 'requests_oauthlib', 'asyncio', 'galaxy', 'docker', 'kubernetes', 'urllib3', 'prov', 'humanfriendly', 'dill', 'websocket', 'salad'}
[2021-03-19T00:49:21+0000] [MainThread] [D] [toil.common] Obtained node ID 6b645fb5-4643-487c-a02d-86176f5db4f0 from file /proc/sys/kernel/random/boot_id
[2021-03-19T00:49:21+0000] [MainThread] [I] [toil.worker] Redirecting logging to /hps/nobackup/production/ensembl/thiagogenez/pairwises/bemisia/run/workdir/node-0499d6b7-bdc1-48a7-bf10-d9f94489af2a-6b645fb5-4643-487c-a02d-86176f5db4f0/tmpfnxn0zl9/worker_log.txt
[2021-03-19T00:49:21+0000] [MainThread] [D] [toil.deferred] Deleting /hps/nobackup/production/ensembl/thiagogenez/pairwises/bemisia/run/workdir/node-0499d6b7-bdc1-48a7-bf10-d9f94489af2a-6b645fb5-4643-487c-a02d-86176f5db4f0/deferred/funcyn_v_01q
[2021-03-19T00:49:21+0000] [MainThread] [D] [toil.lib.threading] Leaving arena /hps/nobackup/production/ensembl/thiagogenez/pairwises/bemisia/run/workdir/0499d6b7-bdc1-48a7-bf10-d9f94489af2a-cleanup-arena-members
[2021-03-19T00:49:21+0000] [MainThread] [D] [toil.lib.threading] PID 39808 acquiring mutex /hps/nobackup/production/ensembl/thiagogenez/pairwises/bemisia/run/workdir/toil-mutex-0499d6b7-bdc1-48a7-bf10-d9f94489af2a-cleanup-arena-lock
[2021-03-19T00:49:21+0000] [MainThread] [D] [toil.lib.threading] PID 39808 now holds mutex /hps/nobackup/production/ensembl/thiagogenez/pairwises/bemisia/run/workdir/toil-mutex-0499d6b7-bdc1-48a7-bf10-d9f94489af2a-cleanup-arena-lock
[2021-03-19T00:49:21+0000] [MainThread] [D] [toil.lib.threading] PID 39808 releasing mutex /hps/nobackup/production/ensembl/thiagogenez/pairwises/bemisia/run/workdir/toil-mutex-0499d6b7-bdc1-48a7-bf10-d9f94489af2a-cleanup-arena-lock
Traceback (most recent call last):
File "/nfs/production/panda/ensembl/thiagogenez/usr/linuxbrew/bin/_toil_worker", line 8, in <module>
sys.exit(main())
File "/nfs/production/panda/ensembl/thiagogenez/usr/linuxbrew/Cellar/cactus/1.3.0/libexec/lib/python3.8/site-packages/toil/worker.py", line 685, in main
exit_code = workerScript(jobStore, config, options.jobName, options.jobStoreID)
File "/nfs/production/panda/ensembl/thiagogenez/usr/linuxbrew/opt/python@3.8/lib/python3.8/contextlib.py", line 120, in __exit__
next(self.gen)
File "/nfs/production/panda/ensembl/thiagogenez/usr/linuxbrew/Cellar/cactus/1.3.0/libexec/lib/python3.8/site-packages/toil/worker.py", line 662, in in_contexts
yield
File "/nfs/production/panda/ensembl/thiagogenez/usr/linuxbrew/Cellar/cactus/1.3.0/libexec/lib/python3.8/site-packages/toil/batchSystems/abstractBatchSystem.py", line 509, in __exit__
for _ in self.arena.leave():
File "/nfs/production/panda/ensembl/thiagogenez/usr/linuxbrew/Cellar/cactus/1.3.0/libexec/lib/python3.8/site-packages/toil/lib/threading.py", line 490, in leave
fd = os.open(full_path, os.O_RDONLY)
FileNotFoundError: [Errno 2] No such file or directory: '/hps/nobackup/production/ensembl/thiagogenez/pairwises/bemisia/run/workdir/0499d6b7-bdc1-48a7-bf10-d9f94489af2a-cleanup-arena-members/tmpxg_9vehp'
<=========
┆Issue is synchronized with this Jira Story ┆friendlyId: TOIL-824
Issue Analytics
- State:
- Created 3 years ago
- Comments:5 (4 by maintainers)
Top Results From Across the Web
Error Handling in a batch file - Windows CMD - SS64.com
How-to: Error Handling in a batch file. When an executable program runs and completes a task, it will return an Exit Code indicating...
Read more >What are batch file exit codes or errorlevels? - ManageEngine
Use the command EXIT /B %ERRORLEVEL% at the end of the batch file to return the error codes from the batch file.
Read more >Batch program exiting for no reason without error
The error message I see after ensuring that echo is on is: echo was unexpected at this time. This is not entirely enlightening, ......
Read more >Batch File Return Code - Explanation And Example
In this tutorial, you will learn about batch file return code in detail. ... These returned error codes are also called exit codes....
Read more >Exiting a batch file without exiting the command shell
In your batch file, you may want to exit batch file processing (say, you encountered an error and want to give up), but...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Not quite; it should tolerate a shared --workDir, because it consults
/proc
to find out what machine it is on and creates an arena named with (I think) the UUID of the current system boot. But the shared--workDir
would have to have the same consistency guarantees andfcntl
locking support as a “normal” (e.g. ext4) filesystem, so GPFS or some speed-optimized NFS configurations will not work.I suppose it’s also possible that the logic to name the arenas differently for the different machines isn’t working; that could also cause the results you are seeing.
Maybe what we should do is ignore the Toil work directory for these arena files, or at least ignore it if it looks like a shared file system, and try a few temp directories (or
/var/run
directories) looking for a local place to put them.➤ Melaina Legaspi commented:
Closing in favor of toil-1071