question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Exiting the batchSystem error

See original GitHub issue

I have decided to use --workDir WORKDIR and --cleanWorkDir never for debugging purposes when running Cactus on an LSF-based system. WORKDIR is a shared filesystem between LSF nodes.

I haven’t captured this sort of error before when the options above have been used with default values. I have no idea why FileNotFoundError was raised when the job was exiting the batch system after changing the option values.

The job has been successfully executed. The error is only at the exit step of the batch system.

Any idea?

Cheers!

Job failed with exit value 1: 'RoundedJob' kind-RoundedJob/instance-w5vahj8b
No log file is present, despite job failing: 'RoundedJob' kind-RoundedJob/instance-w5vahj8b
The batch system left a non-empty file /hps/nobackup/production/ensembl/thiagogenez/pairwises/bemisia/run/workdir/toil_workflow_0499d6b7-bdc1-48a7-bf10-d9f94489af2a_job_110_batch_lsf_3036237_std_output.log:
Log from job "kind-RoundedJob/instance-w5vahj8b" follows:
=========>

	------------------------------------------------------------
	Sender: LSF System <lsf@hx-noah-04-16>
	Subject: Job 3036237: <toil_job_110> in cluster <EBI> Exited

	Job <toil_job_110> was submitted from host <noah-login-01> by user <thiagogenez> in cluster <EBI> at Fri Mar 19 00:49:18 2021
	Job was executed on host(s) <hx-noah-04-16>, in queue <research-rh74>, as user <thiagogenez> in cluster <EBI> at Fri Mar 19 00:49:18 2021
	</homes/thiagogenez> was used as the home directory.
	</hps/nobackup/production/ensembl/thiagogenez/pairwises/bemisia/run/.> was used as the working directory.
	Started at Fri Mar 19 00:49:18 2021
	Terminated at Fri Mar 19 00:49:22 2021
	Results reported at Fri Mar 19 00:49:22 2021

	Your job looked like:

	------------------------------------------------------------
	# LSBATCH: User input
	_toil_worker RoundedJob file:/hps/nobackup/production/ensembl/thiagogenez/pairwises/bemisia/run/jobstore/2 kind-RoundedJob/instance-w5vahj8b --context gASV/wAAAAAAAACMJXRvaWwuYmF0Y2hTeXN0ZW1zLmFic3RyYWN0QmF0Y2hTeXN0ZW2UjBRXb3JrZXJDbGVhbnVwQ29udGV4dJSTlCmBlH2UKIwRd29ya2VyQ2xlYW51cEluZm+UaACMEVdvcmtlckNsZWFudXBJbmZvlJOUjEovaHBzL25vYmFja3VwL3Byb2R1Y3Rpb24vZW5zZW1ibC90aGlhZ29nZW5lei9wYWlyd2lzZXMvYmVtaXNpYS9ydW4vd29ya2RpcpSMJDA0OTlkNmI3LWJkYzEtNDhhNy1iZjEwLWQ5Zjk0NDg5YWYyYZSMBW5ldmVylIeUgZSMBWFyZW5hlE51Yi4=
	------------------------------------------------------------

	Exited with exit code 1.

	Resource usage summary:

	    CPU time :                                   1.72 sec.
	    Max Memory :                                 81 MB
	    Average Memory :                             75.00 MB
	    Total Requested Memory :                     2048.00 MB
	    Delta Memory :                               1967.00 MB
	    Max Swap :                                   229 MB
	    Max Processes :                              3
	    Max Threads :                                4
	    Run time :                                   4 sec.
	    Turnaround time :                            4 sec.

	The output (if any) is above this job summary.



	PS:

	Read file </hps/nobackup/production/ensembl/thiagogenez/pairwises/bemisia/run/workdir/toil_workflow_0499d6b7-bdc1-48a7-bf10-d9f94489af2a_job_110_batch_lsf_3036237_std_error.log> for stderr output of this job.

<=========
The batch system left a non-empty file /hps/nobackup/production/ensembl/thiagogenez/pairwises/bemisia/run/workdir/toil_workflow_0499d6b7-bdc1-48a7-bf10-d9f94489af2a_job_110_batch_lsf_3036237_std_error.log:
Log from job "kind-RoundedJob/instance-w5vahj8b" follows:
=========>
	[2021-03-19T00:49:21+0000] [MainThread] [D] [toil.statsAndLogging] Suppressing the following loggers: {'boto', 'google', 'rdflib', 'boto3', 'oauthlib', 'requests', 'bcdocs', 'botocore', 'cachecontrol', 'concurrent', 'requests_oauthlib', 'asyncio', 'galaxy', 'docker', 'kubernetes', 'urllib3', 'prov', 'humanfriendly', 'dill', 'websocket', 'salad'}
	[2021-03-19T00:49:21+0000] [MainThread] [D] [toil.common] Obtained node ID 6b645fb5-4643-487c-a02d-86176f5db4f0 from file /proc/sys/kernel/random/boot_id
	[2021-03-19T00:49:21+0000] [MainThread] [I] [toil.worker] Redirecting logging to /hps/nobackup/production/ensembl/thiagogenez/pairwises/bemisia/run/workdir/node-0499d6b7-bdc1-48a7-bf10-d9f94489af2a-6b645fb5-4643-487c-a02d-86176f5db4f0/tmpfnxn0zl9/worker_log.txt
	[2021-03-19T00:49:21+0000] [MainThread] [D] [toil.deferred] Deleting /hps/nobackup/production/ensembl/thiagogenez/pairwises/bemisia/run/workdir/node-0499d6b7-bdc1-48a7-bf10-d9f94489af2a-6b645fb5-4643-487c-a02d-86176f5db4f0/deferred/funcyn_v_01q
	[2021-03-19T00:49:21+0000] [MainThread] [D] [toil.lib.threading] Leaving arena /hps/nobackup/production/ensembl/thiagogenez/pairwises/bemisia/run/workdir/0499d6b7-bdc1-48a7-bf10-d9f94489af2a-cleanup-arena-members
	[2021-03-19T00:49:21+0000] [MainThread] [D] [toil.lib.threading] PID 39808 acquiring mutex /hps/nobackup/production/ensembl/thiagogenez/pairwises/bemisia/run/workdir/toil-mutex-0499d6b7-bdc1-48a7-bf10-d9f94489af2a-cleanup-arena-lock
	[2021-03-19T00:49:21+0000] [MainThread] [D] [toil.lib.threading] PID 39808 now holds mutex /hps/nobackup/production/ensembl/thiagogenez/pairwises/bemisia/run/workdir/toil-mutex-0499d6b7-bdc1-48a7-bf10-d9f94489af2a-cleanup-arena-lock
	[2021-03-19T00:49:21+0000] [MainThread] [D] [toil.lib.threading] PID 39808 releasing mutex /hps/nobackup/production/ensembl/thiagogenez/pairwises/bemisia/run/workdir/toil-mutex-0499d6b7-bdc1-48a7-bf10-d9f94489af2a-cleanup-arena-lock
	Traceback (most recent call last):
	  File "/nfs/production/panda/ensembl/thiagogenez/usr/linuxbrew/bin/_toil_worker", line 8, in <module>
	    sys.exit(main())
	  File "/nfs/production/panda/ensembl/thiagogenez/usr/linuxbrew/Cellar/cactus/1.3.0/libexec/lib/python3.8/site-packages/toil/worker.py", line 685, in main
	    exit_code = workerScript(jobStore, config, options.jobName, options.jobStoreID)
	  File "/nfs/production/panda/ensembl/thiagogenez/usr/linuxbrew/opt/python@3.8/lib/python3.8/contextlib.py", line 120, in __exit__
	    next(self.gen)
	  File "/nfs/production/panda/ensembl/thiagogenez/usr/linuxbrew/Cellar/cactus/1.3.0/libexec/lib/python3.8/site-packages/toil/worker.py", line 662, in in_contexts
	    yield
	  File "/nfs/production/panda/ensembl/thiagogenez/usr/linuxbrew/Cellar/cactus/1.3.0/libexec/lib/python3.8/site-packages/toil/batchSystems/abstractBatchSystem.py", line 509, in __exit__
	    for _ in self.arena.leave():
	  File "/nfs/production/panda/ensembl/thiagogenez/usr/linuxbrew/Cellar/cactus/1.3.0/libexec/lib/python3.8/site-packages/toil/lib/threading.py", line 490, in leave
	    fd = os.open(full_path, os.O_RDONLY)
	FileNotFoundError: [Errno 2] No such file or directory: '/hps/nobackup/production/ensembl/thiagogenez/pairwises/bemisia/run/workdir/0499d6b7-bdc1-48a7-bf10-d9f94489af2a-cleanup-arena-members/tmpxg_9vehp'
<=========

┆Issue is synchronized with this Jira Story ┆friendlyId: TOIL-824

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:5 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
adamnovakcommented, Mar 19, 2021

Not quite; it should tolerate a shared --workDir, because it consults /proc to find out what machine it is on and creates an arena named with (I think) the UUID of the current system boot. But the shared --workDir would have to have the same consistency guarantees and fcntl locking support as a “normal” (e.g. ext4) filesystem, so GPFS or some speed-optimized NFS configurations will not work.

I suppose it’s also possible that the logic to name the arenas differently for the different machines isn’t working; that could also cause the results you are seeing.

Maybe what we should do is ignore the Toil work directory for these arena files, or at least ignore it if it looks like a shared file system, and try a few temp directories (or /var/run directories) looking for a local place to put them.

0reactions
unito-botcommented, Feb 1, 2022

➤ Melaina Legaspi commented:

Closing in favor of toil-1071

Read more comments on GitHub >

github_iconTop Results From Across the Web

Error Handling in a batch file - Windows CMD - SS64.com
How-to: Error Handling in a batch file. When an executable program runs and completes a task, it will return an Exit Code indicating...
Read more >
What are batch file exit codes or errorlevels? - ManageEngine
Use the command EXIT /B %ERRORLEVEL% at the end of the batch file to return the error codes from the batch file.
Read more >
Batch program exiting for no reason without error
The error message I see after ensuring that echo is on is: echo was unexpected at this time. This is not entirely enlightening, ......
Read more >
Batch File Return Code - Explanation And Example
In this tutorial, you will learn about batch file return code in detail. ... These returned error codes are also called exit codes....
Read more >
Exiting a batch file without exiting the command shell
In your batch file, you may want to exit batch file processing (say, you encountered an error and want to give up), but...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found