Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

call caching consistently disabled (AWS Batch back end)

See original GitHub issue

Backend: AWS Batch

Workflow: https://github.com/FredHutch/reproducible-workflows/blob/master/WDL/unpaired-panel-consensus-variants-human/frankenstein.wdl

Input file: https://github.com/FredHutch/reproducible-workflows/blob/master/WDL/unpaired-panel-consensus-variants-human/map-variantcall-hg38.json

Possibly related to #4412 but not sure as I don’t see the same error message.

When submitting a workflow via the cromwell server we consistently see a failure to hash some items in S3 resulting in call caching being disabled for the run. We have seen this for a number of workflows, here we are including just one.

Call caching is a hugely important feature for us and if it is not available we may would have to reconsider using Cromwell.

I think I have discussed with @ruchim the fact that all objects in S3 have a hash already computed (the ETag header) so there should not be timeouts in computing these hashes as they are available with a head request (you don’t need to download the whole object).

Error message (extract from /metadata output):

     "callCaching": {
          "hashFailures": [
            {
              "causedBy": [],
              "message": "Hashing request timed out for: s3://bucketname/cromwell-tests/Panel_BWA_GATK4_Samtools_Var_Annotate/162c863f-c22a-4b7c-bb37-f5195b329b36/call-ApplyBQSR/shard-0/smallTestData.hg38.recal.bam"
            }
          ],
          "allowResultReuse": false,
          "hit": false,
          "result": "Cache Miss",
          "effectiveCallCachingMode": "CallCachingOff"
        },

Config file:

include required(classpath("application"))


call-caching {
    enabled = true
    invalidate-bad-cache-results = true
}


database {
  # Store metadata in a file on disk that can grow much larger than RAM limits.
    profile = "slick.jdbc.HsqldbProfile$"
    db {
      driver = "org.hsqldb.jdbcDriver"
      url = "jdbc:hsqldb:file:aws-database;shutdown=false;hsqldb.tx=mvcc"
      connectionTimeout = 3000
    }
}



aws {
  application-name = "cromwell"
  auths = [
    {
      name = "default"
      scheme = "default"
    }
    {
        name = "assume-role-based-on-another"
        scheme = "assume_role"
        base-auth = "default"
        role-arn = "arn:aws:iam::xxx:role/fbucketname"
    }
  ]
  // diff 1:
  # region = "us-west-2" // uses region from ~/.aws/config set by aws configure command,
  #                    // or us-east-1 by default
}
engine {
  filesystems {
    s3 {
      auth = "assume-role-based-on-another"
    }
  }
}
backend {
  default = "AWSBATCH"
  providers {
    AWSBATCH {
      actor-factory = "cromwell.backend.impl.aws.AwsBatchBackendLifecycleActorFactory"
      config {
        // Base bucket for workflow executions
        root = "s3://bucketname/cromwell-tests"
        // A reference to an auth defined in the `aws` stanza at the top.  This auth is used to create
        // Jobs and manipulate auth JSONs.
        auth = "default"
        // diff 2:
        numSubmitAttempts = 1
        // diff 3:
        numCreateDefinitionAttempts = 1
        default-runtime-attributes {
          queueArn: "arn:aws:batch:us-west-2:xxx:job-queue/GenomicsHighPriorityQue-xxx"
        }
        filesystems {
          s3 {
            // A reference to a potentially different auth for manipulating files via engine functions.
            auth = "default"
          }
        }
      }
    }
  }
}

Issue Analytics

State:
Created 5 years ago
Reactions:2
Comments:31 (20 by maintainers)

Top GitHub Comments

1reaction

TimurIscommented, Feb 25, 2019

When are you going to fix it? It is quite hard to work on AWS without the call caching working properly…

0reactions

vortexingcommented, Apr 5, 2019

ETA: It seems that the FIRST task did successfully pick up that it had been run before and managed to reuse it’s output, BUT every one of the subsequent tasks did not manage to realize that the output of that same task in the previous identical workflow was already in existence. So baby steps?