Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

cwltoil managing LSF will occasionally print dead jobs logging info repetitively in DEBUG mode

See original GitHub issue

I wrote about this a bit on gitter. This is non-deterministic - I recall another issue recently where jobs landing on the same node would try to delete each other in the LSF environment. Perhaps something similar is happening here where previous jobs on the node somehow cause this repetitive logging, but I can’t easily see where if so.

Some Jobs that have launched and completed on LSF will print messages such as this over and over: Usually it is one job spamming their end message, but i think i have an example where two different jobs do it, so this might support the “land on same node” hypothesis.

selene.cbio.private 2017-06-23 06:40:13,253 Thread-4 DEBUG toil.statsAndLogging: Received Toil worker log. Disable debug level logging to hide this output
selene.cbio.private 2017-06-23 06:40:13,254 Thread-4 DEBUG toil.statsAndLogging: 'cmo_trimgalore' cmo_trimgalore C/e/jobp_P6Km    ---TOIL WORKER OUTPUT LOG---
selene.cbio.private 2017-06-23 06:40:13,254 Thread-4 DEBUG toil.statsAndLogging: 'cmo_trimgalore' cmo_trimgalore C/e/jobp_P6Km    DEBUG:toil.worker:Next available file descriptor: 5
selene.cbio.private 2017-06-23 06:40:13,254 Thread-4 DEBUG toil.statsAndLogging: 'cmo_trimgalore' cmo_trimgalore C/e/jobp_P6Km    INFO:toil:Running Toil version 3.8.0a1-dae4ada363ae17eed60babf913e6f76e40f408e9.
selene.cbio.private 2017-06-23 06:40:13,254 Thread-4 DEBUG toil.statsAndLogging: 'cmo_trimgalore' cmo_trimgalore C/e/jobp_P6Km    DEBUG:toil:Configuration: {'rescueJobsFrequency': 3600, 'logLevel': 'DEBUG', 'maxMemory': 9223372036854775807, 'jobStore': 'file:/ifs/work/chunj/prism-proto/prism/tmp/jobstore-f5084bee-5796-11e7-b776-645106efb11c', 'defaultPreemptable': False, 'disableHotDeployment': False, 'maxPreemptableServiceJobs': 9223372036854775807, 'servicePollingInterval': 60, 'workDir': '/ifs/work/chunj/prism-proto/prism/tmp', 'stats': True, 'disableCaching': True, 'sseKey': None, 'nodeOptions': None, 'environment': {}, 'minNodes': 0, 'cleanWorkDir': 'never', 'maxCores': 9223372036854775807, 'minPreemptableNodes': 0, 'maxPreemptableNodes': 0, 'maxDisk': 9223372036854775807, 'scaleInterval': 30, 'deadlockWait': 60, 'preemptableNodeType': None, 'nodeType': None, 'clusterStats': None, 'defaultCores': 1, 'parasolMaxBatches': 10000, 'cseKey': None, 'betaInertia': 1.2, 'maxNodes': 10, 'scale': 1, 'writeLogs': '/ifs/work/chunj/prism-proto/ifs/prism/inputs/charris/examples/Proj_05583_F/output/log', 'badWorker': 0.0, 'defaultDisk': 10737418240, 'mesosMasterAddress': 'localhost:5050', 'restart': False, 'useAsync': True, 'preemptableCompensation': 0.0, 'parasolCommand': 'parasol', 'workflowID': '43a666f2-faf7-4a49-8dc6-fed8b6f51705', 'alphaPacking': 0.8, 'maxServiceJobs': 9223372036854775807, 'readGlobalFileMutableByDefault': False, 'badWorkerFailInterval': 0.01, 'maxLogFileSize': 0, 'defaultMemory': 12884901888, 'preemptableNodeOptions': None, 'workflowAttemptNumber': 0, 'maxJobDuration': 9223372036854775807, 'clean': 'never', 'provisioner': None, 'batchSystem': 'lsf', 'retryCount': 0, 'writeLogsGzip': None}
selene.cbio.private 2017-06-23 06:40:13,255 Thread-4 DEBUG toil.statsAndLogging: 'cmo_trimgalore' cmo_trimgalore C/e/jobp_P6Km    DEBUG:toil.worker:Parsed jobGraph
selene.cbio.private 2017-06-23 06:40:13,255 Thread-4 DEBUG toil.statsAndLogging: 'cmo_trimgalore' cmo_trimgalore C/e/jobp_P6Km    DEBUG:toil.worker:Cleaned up any references to completed successor jobs
selene.cbio.private 2017-06-23 06:40:13,255 Thread-4 DEBUG toil.statsAndLogging: 'cmo_trimgalore' cmo_trimgalore C/e/jobp_P6Km    INFO:toil.worker:Worker log can be found at /ifs/work/chunj/prism-proto/prism/tmp/toil-43a666f2-faf7-4a49-8dc6-fed8b6f51705/tmpePa3X6. Set --cleanWorkDir to retain this log
selene.cbio.private 2017-06-23 06:40:13,256 Thread-4 DEBUG toil.statsAndLogging: 'cmo_trimgalore' cmo_trimgalore C/e/jobp_P6Km    INFO:toil.worker:Finished running the chain of jobs on this node, we ran for a total of 0.000140 seconds

here is our cwltoil run line I can give you more details on any of these variables that matter to you.

cwltoil \
    ${PRISM_BIN_PATH}/pipeline/${PIPELINE_VERSION}/${WORKFLOW_FILENAME} \
    ${INPUT_FILENAME} \
    --jobStore file://${jobstore_path} \
    --defaultDisk 10G \
    --defaultMem 12G \
    --preserve-environment PATH PRISM_DATA_PATH PRISM_BIN_PATH PRISM_EXTRA_BIND_PATH PRISM_INPUT_PATH PRISM_OUTPUT_PATH PRISM_SINGULARITY_PATH CMO_RESOURCE_CONFIG \
    --no-container \
    --not-strict \
    --disableCaching \
    --realTimeLogging \
    --maxLogFileSize 0 \
    --writeLogs ${OUTPUT_DIRECTORY}/log \
    --logFile ${OUTPUT_DIRECTORY}/log/cwltoil.log \
    --workDir ${PRISM_BIN_PATH}/tmp \
    --outdir ${OUTPUT_DIRECTORY} ${RESTART_OPTIONS} ${BATCH_SYS_OPTIONS} ${DEBUG_OPTIONS} \
    | tee ${OUTPUT_DIRECTORY}/output-meta.json

┆Issue is synchronized with this Jira Story ┆Issue Number: TOIL-173

Issue Analytics

State:
Created 6 years ago
Reactions:1
Comments:9 (8 by maintainers)

Top GitHub Comments

1reaction

lordzappocommented, Jun 23, 2017

Once a job, or perhaps the logger, gets into this state, the same message will be printed infinitely until the workflow stops. We had “max log size” turned off and debug on for a run today, and it returned the log from Abra which has a large amount of stdout repetitively, such that the log was growing 1 megabyte per second. It would dump this abra log, then one could see the “waiting for bjobs id #####” that is the normal leader status output, then same abra log.

Thanks very much and again- I can provide any more detailed information you require, or run any command you like and return the output. Most of the logs end up being 50+mb because of this issue, but i could ship one if you wish.

0reactions

tetroncommented, Aug 25, 2017

Based on the conversation, I don’t think there’s anything specific to cwltoil going on here. (Sorry didn’t mean to close the issue, hit the wrong button).