question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Caper fails to launch pipeline

See original GitHub issue

Exception thrown on pipeline launch, seems to be a Caper problem (get same output when running caper without input JSON). Running on SLURM cluster with pipeline v1.8.0.

Caper configuration file

backend=slurm

# define one of the followings (or both) according to your
# cluster's SLURM configuration.
slurm-partition=normal
slurm-account=

# Hashing strategy for call-caching (3 choices)
# This parameter is for local (local/slurm/sge/pbs) backend only.
# This is important for call-caching,
# which means re-using outputs from previous/failed workflows.
# Cache will miss if different strategy is used.
# "file" method has been default for all old versions of Caper<1.0.
# "path+modtime" is a new default for Caper>=1.0,
#   file: use md5sum hash (slow).
#   path: use path.
#   path+modtime: use path and modification time.
local-hash-strat=path+modtime

# Local directory for localized files and Cromwell's intermediate files
# If not defined, Caper will make .caper_tmp/ on local-out-dir or CWD.
# /tmp is not recommended here since Caper store all localized data files
# on this directory (e.g. input FASTQs defined as URLs in input JSON).
local-loc-dir=/home/kdemuren/data/tmp-caper/

cromwell=/home/kdemuren/.caper/cromwell_jar/cromwell-52.jar
womtool=/home/kdemuren/.caper/womtool_jar/womtool-52.jar

Stdout/error from pipeline run

2020-08-22 18:05:03,673|caper.caper_base|INFO| Creating a timestamped temporary directory. /home/kdemuren/data/tmp-caper/atac/20200822_180503_672685
2020-08-22 18:05:03,673|caper.caper_runner|INFO| Localizing files on work_dir. /home/kdemuren/data/tmp-caper/atac/20200822_180503_672685
2020-08-22 18:05:05,158|caper.cromwell|INFO| Validating WDL/inputs/imports with Womtool...
2020-08-22 18:05:10,364|caper.cromwell|INFO| Womtool validation passed.
2020-08-22 18:05:10,365|caper.caper_runner|INFO| launching run: wdl=/net/bmc-pub14/data/boyer/users/kdemuren/atac-seq-pipeline/atac.wdl, inputs=/net/bmc-pub14/data/boyer/users/kdemuren/CM_ATAC_DMSO.json, backend_conf=/home/kdemuren/data/tmp-caper/atac/20200822_180503_672685/backend.conf
2020-08-22 18:05:27,113|caper.cromwell_workflow_monitor|INFO| Workflow: id=4238d010-93c3-40f2-b315-ab71b795acfc, status=Submitted
2020-08-22 18:05:27,175|caper.cromwell_workflow_monitor|INFO| Workflow: id=4238d010-93c3-40f2-b315-ab71b795acfc, status=Running
2020-08-22 18:05:30,631|caper.cromwell_workflow_monitor|INFO| Workflow: id=4238d010-93c3-40f2-b315-ab71b795acfc, status=Failed
2020-08-22 18:05:37,153|caper.cromwell_metadata|WARNING| Failed to write metadata file. workflowRoot not found. wf_id=4238d010-93c3-40f2-b315-ab71b795acfc
2020-08-22 18:05:37,153|caper.cromwell|INFO| Workflow failed. Auto-troubleshooting...
2020-08-22 18:05:37,153|caper.nb_subproc_thread|ERROR| Subprocess failed. returncode=1
2020-08-22 18:05:37,153|caper.cli|ERROR| Check stdout/stderr in /net/bmc-pub14/data/boyer/users/kdemuren/output-atac/CM_DMSO/cromwell.out
* Started troubleshooting workflow: id=4238d010-93c3-40f2-b315-ab71b795acfc, status=Failed
* Found failures JSON object.
[
    {
        "causedBy": [
            {
                "message": "Task raise_exception has an invalid runtime attribute memory = !! NOT FOUND !!",
                "causedBy": []
            }
        ],
        "message": "Runtime validation failed"
    }
]
* Recursively finding failures in calls (tasks)...

cromwell.out

2020-08-22 18:05:12,893  INFO  - Running with database db.url = jdbc:hsqldb:mem:ceaa239f-a238-454e-b534-c2601dc92129;shutdown=false;hsqldb.tx=mvcc
2020-08-22 18:05:25,396  INFO  - Running migration RenameWorkflowOptionsInMetadata with a read batch size of 100000 and a write batch size of 100000
2020-08-22 18:05:25,422  INFO  - [RenameWorkflowOptionsInMetadata] 100%
2020-08-22 18:05:25,583  INFO  - Running with database db.url = jdbc:hsqldb:mem:aea05e05-7f06-4502-bd4c-8e5edebef296;shutdown=false;hsqldb.tx=mvcc
2020-08-22 18:05:26,180  INFO  - Slf4jLogger started
2020-08-22 18:05:26,424 cromwell-system-akka.dispatchers.engine-dispatcher-5 INFO  - Workflow heartbeat configuration:
{
  "cromwellId" : "cromid-28ec5ec",
  "heartbeatInterval" : "2 minutes",
  "ttl" : "10 minutes",
  "failureShutdownDuration" : "5 minutes",
  "writeBatchSize" : 10000,
  "writeThreshold" : 10000
}
2020-08-22 18:05:26,475 cromwell-system-akka.dispatchers.service-dispatcher-12 INFO  - Metadata summary refreshing every 1 second.
2020-08-22 18:05:26,504 cromwell-system-akka.dispatchers.service-dispatcher-8 INFO  - WriteMetadataActor configured to flush with batch size 200 and process rate 5 seconds.
2020-08-22 18:05:26,508 cromwell-system-akka.actor.default-dispatcher-4 INFO  - KvWriteActor configured to flush with batch size 200 and process rate 5 seconds.
2020-08-22 18:05:26,526 cromwell-system-akka.dispatchers.engine-dispatcher-32 INFO  - CallCacheWriteActor configured to flush with batch size 100 and process rate 3 seconds.
2020-08-22 18:05:26,527  WARN  - 'docker.hash-lookup.gcr-api-queries-per-100-seconds' is being deprecated, use 'docker.hash-lookup.gcr.throttle' instead (see reference.conf)
2020-08-22 18:05:27,000 cromwell-system-akka.dispatchers.engine-dispatcher-32 INFO  - JobExecutionTokenDispenser - Distribution rate: 1 per 2 seconds.
2020-08-22 18:05:27,030 cromwell-system-akka.dispatchers.engine-dispatcher-5 INFO  - SingleWorkflowRunnerActor: Version 52
2020-08-22 18:05:27,043 cromwell-system-akka.dispatchers.engine-dispatcher-5 INFO  - SingleWorkflowRunnerActor: Submitting workflow
2020-08-22 18:05:27,103 cromwell-system-akka.dispatchers.api-dispatcher-36 INFO  - Unspecified type (Unspecified version) workflow 4238d010-93c3-40f2-b315-ab71b795acfc submitted
2020-08-22 18:05:27,131 cromwell-system-akka.dispatchers.engine-dispatcher-32 INFO  - SingleWorkflowRunnerActor: Workflow submitted UUID(4238d010-93c3-40f2-b315-ab71b795acfc)
2020-08-22 18:05:27,144 cromwell-system-akka.dispatchers.engine-dispatcher-33 INFO  - 1 new workflows fetched by cromid-28ec5ec: 4238d010-93c3-40f2-b315-ab71b795acfc
2020-08-22 18:05:27,160 cromwell-system-akka.dispatchers.engine-dispatcher-32 INFO  - WorkflowManagerActor Starting workflow UUID(4238d010-93c3-40f2-b315-ab71b795acfc)
2020-08-22 18:05:27,169 cromwell-system-akka.dispatchers.engine-dispatcher-32 INFO  - WorkflowManagerActor Successfully started WorkflowActor-4238d010-93c3-40f2-b315-ab71b795acfc
2020-08-22 18:05:27,169 cromwell-system-akka.dispatchers.engine-dispatcher-32 INFO  - Retrieved 1 workflows from the WorkflowStoreActor
2020-08-22 18:05:27,202 cromwell-system-akka.dispatchers.engine-dispatcher-33 INFO  - WorkflowStoreHeartbeatWriteActor configured to flush with batch size 10000 and process rate 2 minutes.
2020-08-22 18:05:27,329 cromwell-system-akka.dispatchers.engine-dispatcher-32 INFO  - MaterializeWorkflowDescriptorActor [UUID(4238d010)]: Parsing workflow as WDL 1.0
2020-08-22 18:05:30,315 cromwell-system-akka.dispatchers.engine-dispatcher-32 INFO  - MaterializeWorkflowDescriptorActor [UUID(4238d010)]: Call-to-Backend assignments: atac.reproducibility_idr -> slurm, atac.fraglen_stat_pe -> slurm, atac.macs2_signal_track -> slurm, atac.idr_pr -> slurm, atac.count_signal_track -> slurm, atac.tss_enrich -> slurm, atac.pool_blacklist -> slurm, atac.idr -> slurm, atac.xcor -> slurm, atac.compare_signal_to_roadmap -> slurm, atac.filter_no_dedup -> slurm, atac.overlap -> slurm, atac.read_genome_tsv -> slurm, atac.frac_mito -> slurm, atac.reproducibility_overlap -> slurm, atac.bam2ta -> slurm, atac.idr_ppr -> slurm, atac.macs2_signal_track_pooled -> slurm, atac.gc_bias -> slurm, atac.error_input_data -> slurm, atac.preseq -> slurm, atac.filter -> slurm, atac.call_peak_ppr1 -> slurm, atac.call_peak_pr1 -> slurm, atac.pool_ta_pr2 -> slurm, atac.count_signal_track_pooled -> slurm, atac.call_peak_pooled -> slurm, atac.align_mito -> slurm, atac.align -> slurm, atac.call_peak_ppr2 -> slurm, atac.jsd -> slurm, atac.call_peak -> slurm, atac.qc_report -> slurm, atac.call_peak_pr2 -> slurm, atac.bam2ta_no_dedup -> slurm, atac.pool_ta_pr1 -> slurm, atac.pool_ta -> slurm, atac.overlap_pr -> slurm, atac.spr -> slurm, atac.annot_enrich -> slurm, atac.overlap_ppr -> slurm
2020-08-22 18:05:30,627 cromwell-system-akka.dispatchers.engine-dispatcher-32 INFO  - WorkflowManagerActor Workflow 4238d010-93c3-40f2-b315-ab71b795acfc failed (during InitializingWorkflowState): Task raise_exception has an invalid runtime attribute memory = !! NOT FOUND !!
2020-08-22 18:05:30,631 cromwell-system-akka.dispatchers.engine-dispatcher-32 INFO  - WorkflowManagerActor WorkflowActor-4238d010-93c3-40f2-b315-ab71b795acfc is in a terminal state: WorkflowFailedState
2020-08-22 18:05:32,009 cromwell-system-akka.dispatchers.engine-dispatcher-6 INFO  - Not triggering log of token queue status. Effective log interval = None
2020-08-22 18:05:32,521 cromwell-system-akka.dispatchers.engine-dispatcher-6 INFO  - SingleWorkflowRunnerActor workflow finished with status 'Failed'.
2020-08-22 18:05:36,645 cromwell-system-akka.dispatchers.engine-dispatcher-68 INFO  - SingleWorkflowRunnerActor writing metadata to /home/kdemuren/data/tmp-caper/atac/20200822_180503_672685/metadata.json
2020-08-22 18:05:36,676  INFO  - Workflow polling stopped
2020-08-22 18:05:36,689  INFO  - 0 workflows released by cromid-28ec5ec
2020-08-22 18:05:36,693  INFO  - Shutting down WorkflowStoreActor - Timeout = 5 seconds
2020-08-22 18:05:36,696  INFO  - Shutting down WorkflowLogCopyRouter - Timeout = 5 seconds
2020-08-22 18:05:36,698  INFO  - Shutting down JobExecutionTokenDispenser - Timeout = 5 seconds
2020-08-22 18:05:36,699 cromwell-system-akka.dispatchers.engine-dispatcher-60 INFO  - Aborting all running workflows.
2020-08-22 18:05:36,699  INFO  - JobExecutionTokenDispenser stopped
2020-08-22 18:05:36,700  INFO  - WorkflowStoreActor stopped
2020-08-22 18:05:36,705  INFO  - WorkflowLogCopyRouter stopped
2020-08-22 18:05:36,705  INFO  - Shutting down WorkflowManagerActor - Timeout = 3600 seconds
2020-08-22 18:05:36,705 cromwell-system-akka.dispatchers.engine-dispatcher-68 INFO  - WorkflowManagerActor All workflows finished
2020-08-22 18:05:36,705  INFO  - WorkflowManagerActor stopped
2020-08-22 18:05:37,015  INFO  - Connection pools shut down
2020-08-22 18:05:37,017  INFO  - Shutting down SubWorkflowStoreActor - Timeout = 1800 seconds
2020-08-22 18:05:37,017  INFO  - Shutting down JobStoreActor - Timeout = 1800 seconds
2020-08-22 18:05:37,017  INFO  - Shutting down CallCacheWriteActor - Timeout = 1800 seconds
2020-08-22 18:05:37,017  INFO  - Shutting down ServiceRegistryActor - Timeout = 1800 seconds
2020-08-22 18:05:37,017  INFO  - Shutting down DockerHashActor - Timeout = 1800 seconds
2020-08-22 18:05:37,017  INFO  - Shutting down IoProxy - Timeout = 1800 seconds
2020-08-22 18:05:37,018  INFO  - JobStoreActor stopped
2020-08-22 18:05:37,018  INFO  - CallCacheWriteActor Shutting down: 0 queued messages to process
2020-08-22 18:05:37,018  INFO  - SubWorkflowStoreActor stopped
2020-08-22 18:05:37,018  INFO  - CallCacheWriteActor stopped
2020-08-22 18:05:37,018  INFO  - WriteMetadataActor Shutting down: 0 queued messages to process
2020-08-22 18:05:37,019  INFO  - KvWriteActor Shutting down: 0 queued messages to process
2020-08-22 18:05:37,021  INFO  - IoProxy stopped

Error log Caper automatically runs a troubleshooter for failed workflows. If it doesn’t then get a WORKFLOW_ID of your failed workflow with caper list or directly use a metadata.json file on Caper’s output directory.

2020-08-22 18:19:23,146|caper.server_heartbeat|ERROR| Failed to read from a heartbeat file. ~/.caper/default_server_heartbeat
Traceback (most recent call last):
  File "/home/kdemuren/miniconda3/envs/encode-atac-seq-pipeline/bin/caper", line 13, in <module>
    main()
  File "/home/kdemuren/miniconda3/envs/encode-atac-seq-pipeline/lib/python3.7/site-packages/caper/cli.py", line 504, in main
    client(parsed_args)
  File "/home/kdemuren/miniconda3/envs/encode-atac-seq-pipeline/lib/python3.7/site-packages/caper/cli.py", line 269, in client
    subcmd_troubleshoot(c, args)
  File "/home/kdemuren/miniconda3/envs/encode-atac-seq-pipeline/lib/python3.7/site-packages/caper/cli.py", line 454, in subcmd_troubleshoot
    wf_ids_or_labels=args.wf_id_or_label, embed_subworkflow=True
  File "/home/kdemuren/miniconda3/envs/encode-atac-seq-pipeline/lib/python3.7/site-packages/caper/caper_client.py", line 129, in metadata
    embed_subworkflow=embed_subworkflow,
  File "/home/kdemuren/miniconda3/envs/encode-atac-seq-pipeline/lib/python3.7/site-packages/caper/cromwell_rest_api.py", line 144, in get_metadata
    workflows = self.find(workflow_ids, labels)
  File "/home/kdemuren/miniconda3/envs/encode-atac-seq-pipeline/lib/python3.7/site-packages/caper/cromwell_rest_api.py", line 226, in find
    CromwellRestAPI.ENDPOINT_WORKFLOWS, params=CromwellRestAPI.PARAMS_WORKFLOWS
  File "/home/kdemuren/miniconda3/envs/encode-atac-seq-pipeline/lib/python3.7/site-packages/caper/cromwell_rest_api.py", line 299, in __request_get
    ) from None
Exception: Failed to connect to Cromwell server. req=GET, url=http://localhost:8000/api/workflows/v1/query

Issue Analytics

  • State:open
  • Created 3 years ago
  • Comments:10 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
leepc12commented, Sep 17, 2020

-lnodes=1:ppn= will be applied to Caper. https://github.com/ENCODE-DCC/caper/pull/91

0reactions
atreeneedsaforestcommented, Aug 31, 2020

Dear @leepc12 , just an update to let you close the issue also for me: it worked! Apparently there is a problem in torque which for some mysterious reason let the pipeline jobs waiting in queue even when resources are available, but after some time (or asking the admin to force them) they end up running and all the analysis was concluded smoothly. So the custom backend was successfully processed in the end. Now I launched the pipeline on my samples with fingers crossed. Thank you very much for help!

Read more comments on GitHub >

github_iconTop Results From Across the Web

caper - PyPI
Caper (Cromwell Assisted Pipeline ExecutoR) is a wrapper Python ... failed workflows with the same command line you used to start them.
Read more >
Delta Live Tables failed to launch pipeline cluster
Delta Live Tables failed to launch pipeline cluster. I'm trying to run through the Delta Live Tables quickstart example on Azure Databricks.
Read more >
ENCODE ATAC-seq analyzing pipeline hands-on tutorial
Running the pipeline and example slurm submission file​​ Here are some notes for the slurm: – caper init local is a must, as...
Read more >
chipseq_pipeline on Biowulf - NIH HPC
In this example the pipeline will only be run locally - i.e. it will not submit tasks as slurm jobs. Follow the caper...
Read more >
Ronald Reagan's Big Pipeline Caper - Townhall
Reagan ultimately failed to stop the building of the pipeline, ... Soviet missile launch, it was just a computer prank I played on...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found