File Bundles and Secondary / Accessory Files
See original GitHub issueDear WDL team!
I was hoping to source the community opinion on file bundles, or CWL’s secondary files. This issue was spawned from: https://github.com/broadinstitute/cromwell/issues/2269 and also within the WDL forum where there was general support for accessory files.
My main input is that I’d like the associated files to be attached within a explicit type, and not annotated on a specific task. Even though it might be easy to determine at runtime whether the list of index files is different, I think it would be clearer to have a bundle type (that might sit alongside the same importing rules as structs).
CWL annotates each input with a list of secondaryFiles that uses a simple query syntax:
- If string begins with one or more caret ^ characters, for each caret, remove the last file extension from the path (the last period . and all following characters). If there are no file extensions, the path is unchanged.
- Append the remainder of the string to the end of the file path.
(They also allow for an expression that should resolve to a filename or an array of files)
My suggestions
To take with a grain salt, I’ve got a few suggestions on how this might be expression in WLD to get the ball rolling. I’ll use a modified indexed BAM, that has three files ($base.bam, $base.bam.bai, $base.txt) to show how the examples hold up:
- Create a
bundle
that has an implicit base file type (like CWL), and references secondary files using this base file with a secondary files selector syntax, probably the same as CWL for consistency and as far as I know it does the job. If coerced into a File, it should just resolve to the base file (in the following case, a Bam).
bundle NamedModifiedIndexBam {
bai = ".bai"
txt = "^.txt"
}
OR (just two suggestions for the same thing, not that WDL should accept both)
bundle AnonymousModifiedIndexBam = [".bai", "^.txt"]
- Create a more explicit bundle that has a basename, and must explicitly reference each associated file. It might be a good idea to have a base property, that would be the resolver when passing to the command line, or potential coersion into a file. Benefits are it doesn’t require the query language and makes it clearer.
bundle ExplicitModifiedIndexBam {
bam = ".bam" # or base = ".bam"
bai = ".bam.bai"
txt = ".txt"
}
Then you could reference them in the same way you do primitives or structs:
task my_task {
ModifiedIndexBam bamFile
NamedModifiedIndexBam namedBamFile
ExplicitModifiedIndexBam explicitBamFile
command {
echo ${bamFile} # :: $base.bam
echo ${namedBamFile.base} # :: $base.bam
echo $(explicitBamFile.bai) # :: $base.bam.bai
}
# example of outputs
output {
ModifiedIndexBam bamOut = glob("output.bam")
NamedModifiedIndexBam namedOut = glob("output.bam")
ExplicitModifiedIndexBam explicitOut = glob("output")
}
}
Issue Analytics
- State:
- Created 5 years ago
- Comments:13 (10 by maintainers)
Top GitHub Comments
I can definitely see some cases where something like this would be useful, indexes being one obvious example. I’m wondering whether it would not be an option to simply allow defining defaults in a struct, rather then introducing an entire new type? I’m geussing there was some reason this isn’t allowed though, considering it is expressly not permitted.
Or just to throw out another idea, adding an
Implicit
(orInferred
) block:Either way, I feel that the second option you provide here would be nicer. Giving
^
some special meaning seems like it might get rather confusing, expescially considering it already has a special meaning in regex.My first thought was to also latch on to the
Struct
concept somehow, although at the moment it doesn’t map cleanly as @DavyCats points out.We could potentially describe a mechanism for (de)localizing entire
Struct
objects (I don’t think the spec currently allows for that) and then as @DavyCats describes, some scheme for pattern matching.