Append output variables from functions to input dataset
See original GitHub issueNot every function will fit the fn(ds: Dataset, ...) -> Dataset
signature, but for the large majority that do we have so far been adopting a convention that only returns newly created variables. Another option would be to always try to write those variables into the input dataset. For lack of a better phrase I’ll call the former “Ex situ” updates and the latter “In situ” updates. Here are some pros/cons of each:
Ex situ updates
- Advantages
- It makes it easier to transform/interrogate results before deciding to merge them
- Merging datasets in Xarray is trivial when there is no index but if result variables are indexed and the provided dataset is not, there is any argument to be made in leaving what to do up to the user
- Function style and it’s less code to write/maintain
- Disadvantages
ds.merge(fn(ds))
is clunky if you wanted the variables merged in the first place- It’s not as nice in pipelines, e.g. you have to do this instead:
ds = xr.merge([ fn1(ds), fn2(ds) ])
In situ updates
- Advantages
- Pipelines are nicer, e.g.
ds.pipe(count_alleles).pipe(allele_frequency)
- Pipelines are nicer, e.g.
- Disadvantages
My more opinionated view of this is that the ex situ updates are best in larger, more sophisticated workflows. I say that because I think those workflows will be very much dominated by Xarray code with calls to individual sgkit functions interspersed within it, so the need for chaining sgkit operations is not very high and the need to transform/interrogate results before merging them in is higher.
It would be great to find some way to get all of those advantages above though. Perhaps there is a way to do something like this more elegantly:
# Assuming `fn` makes inplace updates
ds_only_new_vars = fn(ds)[fn.output_variables]
A potential issue with this is that functions may rely on variables from other functions (see https://github.com/pystatgen/sgkit/pull/102#issue-465519140) and add them when necessary, which would mean the set of output variables isn’t always the same. I don’t think this is actually an issue with Dask as long as the definition of the variables is fast. It would be for most functions and Dask will remove any redundancies, so presumably we wouldn’t have to necessarily return or add them (i.e. this affects both the in situ and ex situ strategies). Using numpy variables definitely wouldn’t work that way but I’m not sure that use case is as practical outside of testing anyhow.
Some of @ravwojdyla’s work to “schemify” function results would make it easy to know what variables a function produces (https://github.com/pystatgen/sgkit/issues/43#issuecomment-669478348). There could be some overlap here with how that works out too.
Issue Analytics
- State:
- Created 3 years ago
- Comments:19 (1 by maintainers)
Top GitHub Comments
Exactly.
Hail does it too and I agree it’s one less thing for new users to learn. In the interest of moving this forward, it sounds like we’re all in agreement on https://github.com/pystatgen/sgkit/issues/103#issuecomment-672091886 (the numbered steps a few posts above) with the only difference being we’ll call the boolean flag
merge
.Related to this what happens if a value in the input-data set is re-calculated? will this be common? For example there are multiple methods for calculating expected heterozygosity, will every method produce a unique variable name or will they or result in the same variable name to ease downstream use?
If a user re-calculates
'heterozygosity_expected'
(withmerge=True
) on a dataset that already contains a value of that name then they should probably see a warning something like:The warnings can then be promoted to errors in ‘production’ pipelines ensure that there is no variable clobbering.
A simple minded approach to this would be to include the arguments
map_inputs
and/ormap_outputs
which each take a dict mapping the ‘default’ variable name to custom names.