call caching consistently disabled (AWS Batch back end)
See original GitHub issueBackend: AWS Batch
Possibly related to #4412 but not sure as I don’t see the same error message.
When submitting a workflow via the cromwell server we consistently see a failure to hash some items in S3 resulting in call caching being disabled for the run. We have seen this for a number of workflows, here we are including just one.
Call caching is a hugely important feature for us and if it is not available we may would have to reconsider using Cromwell.
I think I have discussed with @ruchim the fact that all objects in S3 have a hash already computed (the ETag header) so there should not be timeouts in computing these hashes as they are available with a head request (you don’t need to download the whole object).
Error message (extract from /metadata
output):
"callCaching": {
"hashFailures": [
{
"causedBy": [],
"message": "Hashing request timed out for: s3://bucketname/cromwell-tests/Panel_BWA_GATK4_Samtools_Var_Annotate/162c863f-c22a-4b7c-bb37-f5195b329b36/call-ApplyBQSR/shard-0/smallTestData.hg38.recal.bam"
}
],
"allowResultReuse": false,
"hit": false,
"result": "Cache Miss",
"effectiveCallCachingMode": "CallCachingOff"
},
Config file:
include required(classpath("application"))
call-caching {
enabled = true
invalidate-bad-cache-results = true
}
database {
# Store metadata in a file on disk that can grow much larger than RAM limits.
profile = "slick.jdbc.HsqldbProfile$"
db {
driver = "org.hsqldb.jdbcDriver"
url = "jdbc:hsqldb:file:aws-database;shutdown=false;hsqldb.tx=mvcc"
connectionTimeout = 3000
}
}
aws {
application-name = "cromwell"
auths = [
{
name = "default"
scheme = "default"
}
{
name = "assume-role-based-on-another"
scheme = "assume_role"
base-auth = "default"
role-arn = "arn:aws:iam::xxx:role/fbucketname"
}
]
// diff 1:
# region = "us-west-2" // uses region from ~/.aws/config set by aws configure command,
# // or us-east-1 by default
}
engine {
filesystems {
s3 {
auth = "assume-role-based-on-another"
}
}
}
backend {
default = "AWSBATCH"
providers {
AWSBATCH {
actor-factory = "cromwell.backend.impl.aws.AwsBatchBackendLifecycleActorFactory"
config {
// Base bucket for workflow executions
root = "s3://bucketname/cromwell-tests"
// A reference to an auth defined in the `aws` stanza at the top. This auth is used to create
// Jobs and manipulate auth JSONs.
auth = "default"
// diff 2:
numSubmitAttempts = 1
// diff 3:
numCreateDefinitionAttempts = 1
default-runtime-attributes {
queueArn: "arn:aws:batch:us-west-2:xxx:job-queue/GenomicsHighPriorityQue-xxx"
}
filesystems {
s3 {
// A reference to a potentially different auth for manipulating files via engine functions.
auth = "default"
}
}
}
}
}
}
Issue Analytics
- State:
- Created 5 years ago
- Reactions:2
- Comments:31 (20 by maintainers)
Top GitHub Comments
When are you going to fix it? It is quite hard to work on AWS without the call caching working properly…
ETA: It seems that the FIRST task did successfully pick up that it had been run before and managed to reuse it’s output, BUT every one of the subsequent tasks did not manage to realize that the output of that same task in the previous identical workflow was already in existence. So baby steps?