question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

call caching consistently disabled (AWS Batch back end)

See original GitHub issue

Backend: AWS Batch

Workflow: https://github.com/FredHutch/reproducible-workflows/blob/master/WDL/unpaired-panel-consensus-variants-human/frankenstein.wdl

Input file: https://github.com/FredHutch/reproducible-workflows/blob/master/WDL/unpaired-panel-consensus-variants-human/map-variantcall-hg38.json

Possibly related to #4412 but not sure as I don’t see the same error message.

When submitting a workflow via the cromwell server we consistently see a failure to hash some items in S3 resulting in call caching being disabled for the run. We have seen this for a number of workflows, here we are including just one.

Call caching is a hugely important feature for us and if it is not available we may would have to reconsider using Cromwell.

I think I have discussed with @ruchim the fact that all objects in S3 have a hash already computed (the ETag header) so there should not be timeouts in computing these hashes as they are available with a head request (you don’t need to download the whole object).

Error message (extract from /metadata output):

     "callCaching": {
          "hashFailures": [
            {
              "causedBy": [],
              "message": "Hashing request timed out for: s3://bucketname/cromwell-tests/Panel_BWA_GATK4_Samtools_Var_Annotate/162c863f-c22a-4b7c-bb37-f5195b329b36/call-ApplyBQSR/shard-0/smallTestData.hg38.recal.bam"
            }
          ],
          "allowResultReuse": false,
          "hit": false,
          "result": "Cache Miss",
          "effectiveCallCachingMode": "CallCachingOff"
        },

Config file:

include required(classpath("application"))


call-caching {
    enabled = true
    invalidate-bad-cache-results = true
}


database {
  # Store metadata in a file on disk that can grow much larger than RAM limits.
    profile = "slick.jdbc.HsqldbProfile$"
    db {
      driver = "org.hsqldb.jdbcDriver"
      url = "jdbc:hsqldb:file:aws-database;shutdown=false;hsqldb.tx=mvcc"
      connectionTimeout = 3000
    }
}



aws {
  application-name = "cromwell"
  auths = [
    {
      name = "default"
      scheme = "default"
    }
    {
        name = "assume-role-based-on-another"
        scheme = "assume_role"
        base-auth = "default"
        role-arn = "arn:aws:iam::xxx:role/fbucketname"
    }
  ]
  // diff 1:
  # region = "us-west-2" // uses region from ~/.aws/config set by aws configure command,
  #                    // or us-east-1 by default
}
engine {
  filesystems {
    s3 {
      auth = "assume-role-based-on-another"
    }
  }
}
backend {
  default = "AWSBATCH"
  providers {
    AWSBATCH {
      actor-factory = "cromwell.backend.impl.aws.AwsBatchBackendLifecycleActorFactory"
      config {
        // Base bucket for workflow executions
        root = "s3://bucketname/cromwell-tests"
        // A reference to an auth defined in the `aws` stanza at the top.  This auth is used to create
        // Jobs and manipulate auth JSONs.
        auth = "default"
        // diff 2:
        numSubmitAttempts = 1
        // diff 3:
        numCreateDefinitionAttempts = 1
        default-runtime-attributes {
          queueArn: "arn:aws:batch:us-west-2:xxx:job-queue/GenomicsHighPriorityQue-xxx"
        }
        filesystems {
          s3 {
            // A reference to a potentially different auth for manipulating files via engine functions.
            auth = "default"
          }
        }
      }
    }
  }
}

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Reactions:2
  • Comments:31 (20 by maintainers)

github_iconTop GitHub Comments

1reaction
TimurIscommented, Feb 25, 2019

When are you going to fix it? It is quite hard to work on AWS without the call caching working properly…

0reactions
vortexingcommented, Apr 5, 2019

ETA: It seems that the FIRST task did successfully pick up that it had been run before and managed to reuse it’s output, BUT every one of the subsequent tasks did not manage to realize that the output of that same task in the previous identical workflow was already in existence. So baby steps?

Read more comments on GitHub >

github_iconTop Results From Across the Web

Cromwell on AWS: A simpler and improved AWS Batch backend
Call caching in Cromwell is the ability to avoid recomputing results that have been generated previously as part of a successful or partially ......
Read more >
Caching strategies - Amazon ElastiCache - AWS Documentation
Whenever your application requests data, it first makes the request to the ElastiCache cache. If the data exists in the cache and is...
Read more >
What is Amazon File Cache?
Learn about Amazon File Cache, a high-speed cache on AWS that makes it easier to process file data, regardless of where the data...
Read more >
Understanding the AWS Batch termination process
This blog helps you understand the AWS Batch job termination process and how you may take actions to gracefully terminate a job by...
Read more >
DAX and DynamoDB consistency models - AWS Documentation
Consistency among DAX cluster nodes · DAX item cache behavior · DAX query cache behavior · Strongly consistent and transactional reads · Negative...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found