cwltoil managing LSF will occasionally print dead jobs logging info repetitively in DEBUG mode
See original GitHub issueI wrote about this a bit on gitter. This is non-deterministic - I recall another issue recently where jobs landing on the same node would try to delete each other in the LSF environment. Perhaps something similar is happening here where previous jobs on the node somehow cause this repetitive logging, but I can’t easily see where if so.
Some Jobs that have launched and completed on LSF will print messages such as this over and over: Usually it is one job spamming their end message, but i think i have an example where two different jobs do it, so this might support the “land on same node” hypothesis.
selene.cbio.private 2017-06-23 06:40:13,253 Thread-4 DEBUG toil.statsAndLogging: Received Toil worker log. Disable debug level logging to hide this output
selene.cbio.private 2017-06-23 06:40:13,254 Thread-4 DEBUG toil.statsAndLogging: 'cmo_trimgalore' cmo_trimgalore C/e/jobp_P6Km ---TOIL WORKER OUTPUT LOG---
selene.cbio.private 2017-06-23 06:40:13,254 Thread-4 DEBUG toil.statsAndLogging: 'cmo_trimgalore' cmo_trimgalore C/e/jobp_P6Km DEBUG:toil.worker:Next available file descriptor: 5
selene.cbio.private 2017-06-23 06:40:13,254 Thread-4 DEBUG toil.statsAndLogging: 'cmo_trimgalore' cmo_trimgalore C/e/jobp_P6Km INFO:toil:Running Toil version 3.8.0a1-dae4ada363ae17eed60babf913e6f76e40f408e9.
selene.cbio.private 2017-06-23 06:40:13,254 Thread-4 DEBUG toil.statsAndLogging: 'cmo_trimgalore' cmo_trimgalore C/e/jobp_P6Km DEBUG:toil:Configuration: {'rescueJobsFrequency': 3600, 'logLevel': 'DEBUG', 'maxMemory': 9223372036854775807, 'jobStore': 'file:/ifs/work/chunj/prism-proto/prism/tmp/jobstore-f5084bee-5796-11e7-b776-645106efb11c', 'defaultPreemptable': False, 'disableHotDeployment': False, 'maxPreemptableServiceJobs': 9223372036854775807, 'servicePollingInterval': 60, 'workDir': '/ifs/work/chunj/prism-proto/prism/tmp', 'stats': True, 'disableCaching': True, 'sseKey': None, 'nodeOptions': None, 'environment': {}, 'minNodes': 0, 'cleanWorkDir': 'never', 'maxCores': 9223372036854775807, 'minPreemptableNodes': 0, 'maxPreemptableNodes': 0, 'maxDisk': 9223372036854775807, 'scaleInterval': 30, 'deadlockWait': 60, 'preemptableNodeType': None, 'nodeType': None, 'clusterStats': None, 'defaultCores': 1, 'parasolMaxBatches': 10000, 'cseKey': None, 'betaInertia': 1.2, 'maxNodes': 10, 'scale': 1, 'writeLogs': '/ifs/work/chunj/prism-proto/ifs/prism/inputs/charris/examples/Proj_05583_F/output/log', 'badWorker': 0.0, 'defaultDisk': 10737418240, 'mesosMasterAddress': 'localhost:5050', 'restart': False, 'useAsync': True, 'preemptableCompensation': 0.0, 'parasolCommand': 'parasol', 'workflowID': '43a666f2-faf7-4a49-8dc6-fed8b6f51705', 'alphaPacking': 0.8, 'maxServiceJobs': 9223372036854775807, 'readGlobalFileMutableByDefault': False, 'badWorkerFailInterval': 0.01, 'maxLogFileSize': 0, 'defaultMemory': 12884901888, 'preemptableNodeOptions': None, 'workflowAttemptNumber': 0, 'maxJobDuration': 9223372036854775807, 'clean': 'never', 'provisioner': None, 'batchSystem': 'lsf', 'retryCount': 0, 'writeLogsGzip': None}
selene.cbio.private 2017-06-23 06:40:13,255 Thread-4 DEBUG toil.statsAndLogging: 'cmo_trimgalore' cmo_trimgalore C/e/jobp_P6Km DEBUG:toil.worker:Parsed jobGraph
selene.cbio.private 2017-06-23 06:40:13,255 Thread-4 DEBUG toil.statsAndLogging: 'cmo_trimgalore' cmo_trimgalore C/e/jobp_P6Km DEBUG:toil.worker:Cleaned up any references to completed successor jobs
selene.cbio.private 2017-06-23 06:40:13,255 Thread-4 DEBUG toil.statsAndLogging: 'cmo_trimgalore' cmo_trimgalore C/e/jobp_P6Km INFO:toil.worker:Worker log can be found at /ifs/work/chunj/prism-proto/prism/tmp/toil-43a666f2-faf7-4a49-8dc6-fed8b6f51705/tmpePa3X6. Set --cleanWorkDir to retain this log
selene.cbio.private 2017-06-23 06:40:13,256 Thread-4 DEBUG toil.statsAndLogging: 'cmo_trimgalore' cmo_trimgalore C/e/jobp_P6Km INFO:toil.worker:Finished running the chain of jobs on this node, we ran for a total of 0.000140 seconds
here is our cwltoil run line I can give you more details on any of these variables that matter to you.
cwltoil \
${PRISM_BIN_PATH}/pipeline/${PIPELINE_VERSION}/${WORKFLOW_FILENAME} \
${INPUT_FILENAME} \
--jobStore file://${jobstore_path} \
--defaultDisk 10G \
--defaultMem 12G \
--preserve-environment PATH PRISM_DATA_PATH PRISM_BIN_PATH PRISM_EXTRA_BIND_PATH PRISM_INPUT_PATH PRISM_OUTPUT_PATH PRISM_SINGULARITY_PATH CMO_RESOURCE_CONFIG \
--no-container \
--not-strict \
--disableCaching \
--realTimeLogging \
--maxLogFileSize 0 \
--writeLogs ${OUTPUT_DIRECTORY}/log \
--logFile ${OUTPUT_DIRECTORY}/log/cwltoil.log \
--workDir ${PRISM_BIN_PATH}/tmp \
--outdir ${OUTPUT_DIRECTORY} ${RESTART_OPTIONS} ${BATCH_SYS_OPTIONS} ${DEBUG_OPTIONS} \
| tee ${OUTPUT_DIRECTORY}/output-meta.json
┆Issue is synchronized with this Jira Story ┆Issue Number: TOIL-173
Issue Analytics
- State:
- Created 6 years ago
- Reactions:1
- Comments:9 (8 by maintainers)
Top GitHub Comments
Once a job, or perhaps the logger, gets into this state, the same message will be printed infinitely until the workflow stops. We had “max log size” turned off and debug on for a run today, and it returned the log from Abra which has a large amount of stdout repetitively, such that the log was growing 1 megabyte per second. It would dump this abra log, then one could see the “waiting for bjobs id #####” that is the normal leader status output, then same abra log.
Thanks very much and again- I can provide any more detailed information you require, or run any command you like and return the output. Most of the logs end up being 50+mb because of this issue, but i could ship one if you wish.
Based on the conversation, I don’t think there’s anything specific to cwltoil going on here. (Sorry didn’t mean to close the issue, hit the wrong button).