Automatic Coercions Considered Harmful
See original GitHub issueAutomatic coercions considered harmful
Authors: O. Rodeh with J. Didion (DNAnexus Inc.)
This issue describes a line of thought regarding type checking for WDL programs. It is based on experience acquired while working with DNAnexus customers on their WDL pipelines. The author also wrote a WDL type checker, evaluator, and cloud executor.
Coercions
WDL makes wide ranging use of coercions to convert one type to another. This is, in general, a well accepted idea used in a wide variety of programming languages. At DNAnexus we have run into several instances where WDL workflows errored out because static type checking was too lenient. This resulted in long running expensive customer pipelines failing. The issue is twofold: the turn around time is long, and the cost can run to thousands of dollars.
For example, under the WDL type system, a File
is auto-coercible into File?
. However, this is not the case in the DNAnexus simpler type-system. If you have a task that takes an Array[File?]
and call it with Array[File]
this will fail. For example:
version 1.0
task foo {
input {
Array[File?] bams
}
command {}
}
workflow bar {
input {
Array[File] my_files
}
call long_expensive_calculation {}
call foo { input : bams = my_files }
output {}
}
Workflow bar
will fail when it calls foo
, after it has spent a long time running an expensive calculation. This is frustrating for the users. We could prevent this, as well as many other errors, by performing stricter type checking. Below is an eclectic list of automatic coercions that appears in GATK pipelines. The GATK pipelines are large real world workflows that our customers expect to execute flawlessly on our platform. The code is a simplified version of the real program lines, stripping a way the context to focus on the main point.
Converting between T
and T?
occurs automatically
# convert an optional file-array to a file-array
Array[File] files = ["a", "b"]
Array[File]? files2 = files
# Convert an optional int to an int
Int a = 3
Int? b = a
# An array with a mixture of Int and Int?
Int c = select_first([1, b])
Converting between strings, integer, and floats “just works”.
# Convert an integer to a string
String s = 3
Int z = 1.3
# converting Array[String] to Array[Int]
Int d = "3"
Array[Int] numbers = ["1", "2"]
The branches of a conditional may not resolve to the same type. Here, one side is Array[String]
, the other is Array[String?]
.
# https://github.com/gatk-workflows/gatk4-germline-snps-indels/blob/master/JointGenotyping-terra.wdl#L556
#
Array[String] birds = ["finch", "hawk"]
Array[String?] mammals = []
Array[String?] zoo = if (true) then birds else mammals
Since T
and T?
are interchangeable, we can perform operations like T
+ T?
+ T
.
# https://github.com/gatk-workflows/broad-prod-wgs-germline-snps-indels/blob/master/JointGenotypingWf.wdl#L570
#
# Adding a String and String?
String? here = "here"
String buf = "not " + "invented " + here
# https://github.com/gatk-workflows/broad-prod-wgs-germline-snps-indels/blob/master/PairedEndSingleSampleWf.wdl#L1207
#
# convert an optional int to string.
Int? max_output = 3
String s1 = "MAX_OUTPUT=" + max_output
The following non-intuitive code type checks in WDL (womtool-51.jar) :
Array[String] strings = [1, 1.2, "boo"]
Array[Int] numbers = [1, 1.2, "3.0"]
And then aai
type checks too:
Array[Array[Int]] aai = [[1, 2], ["foo"], [1.4, 4.9, 9.3]]
Proposal
Explicit coercions are useful and make sense, for example:
File? x = y
Array[File] af = ["a.txt", "b.txt"]
However, auto-coercions can lead to unexpected behavior and costly/confusing job failures. We suggest allowing explicit coercions, adding a few necessary stdlib function for type conversion, and disallowing automatic coercions. A few necessary stdlib functions, written a la Scala, are:
T? Some(T)
Int toInt(String)
String toString(Int|Float)
File toFile(String)
In order to select the first non empty integer, as below:
Int c = select_first([1, b])
one would write:
Int c = select_first([Some(1), b])
This would make the typing of select_first
accurate; it would only take an array of T?
.
Code like:
# Convert an optional int to an int
Int a = 3
Int? b = a
would be written:
Int? b = Some(a)
A Pythonic way of writing these functions would be to have a single cast function that takes a type and a value and either casts the value to the type or throws an error. For example:
Int i = cast(Int, "hello")
String s = cast(String, 1)
File f = cast(File, "foo/bar.txt")
File err = cast(File, 1) # throws an error because Int to File casting is not allowed
cast could even handle optional types, which would get rid of the need for Some(), e.g.
String s1 = "foo/bar.txt"
String? s2 = None
File? f1 = cast(File?, s1) # f1 == Some(File("foo/bar.txt"))
File? f2 = cast(File?, s2) # f2 == None
I hope this approach would reduce the number of preventable runtime errors.
Issue Analytics
- State:
- Created 3 years ago
- Comments:6 (4 by maintainers)
Top GitHub Comments
Forcing people to be explicit about their coercions does make WDL more verbose. However, it opens the door to using type inference (@cjllanwarne mentioned this a long time ago).
The WDL type system is actually quite complex. We have overloading,
size
can be called in several different ways. We also have parametric polymorphism,cross
andzip
work on any type of array. On top of that, we have auto-coercions, which interact badly with the former two features. If we drop them, I -think-, we could do Mindley-Milner type inference which would make WDL less verbose.A realistic way of doing this is support type-inference inside a task or a workflow, supplemented with explicit coercions when needed.
How about type coercion of the inputs/outputs from the standard library functions? On the one hand - all inputs and outputs must be typed. But I do see some use cases where:
I could not find any such case in the spec, and I think this is a bit contradicting the static typing requirement. In general though seems these examples compile and run, but we start to see coercion problems when nesting such functions and applying them to optional types. Like this:
In this case we see the problems of coercion for values of
lol
which were evaluated tonull
, throwing exceptions at runtime. If we declare and type the outputs - all fine, no exceptions:We can solve this in dxCompiler and allow such scenarios, but isn’t this in the realm of automatic type coercion and contradictory to strong/static typing of the language?
Could you provide more explicit guidance on typing the inputs/outputs for the standard library functions in the language spec?