Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Append output variables from functions to input dataset

See original GitHub issue

Not every function will fit the fn(ds: Dataset, ...) -> Dataset signature, but for the large majority that do we have so far been adopting a convention that only returns newly created variables. Another option would be to always try to write those variables into the input dataset. For lack of a better phrase I’ll call the former “Ex situ” updates and the latter “In situ” updates. Here are some pros/cons of each:

Ex situ updates

Advantages
- It makes it easier to transform/interrogate results before deciding to merge them
- Merging datasets in Xarray is trivial when there is no index but if result variables are indexed and the provided dataset is not, there is any argument to be made in leaving what to do up to the user
- Function style and it’s less code to write/maintain
Disadvantages
- ds.merge(fn(ds)) is clunky if you wanted the variables merged in the first place
- It’s not as nice in pipelines, e.g. you have to do this instead: ds = xr.merge([ fn1(ds), fn2(ds) ])

In situ updates

Advantages
- Pipelines are nicer, e.g. ds.pipe(count_alleles).pipe(allele_frequency)
Disadvantages
- We’re likely to have some functions that produce 10s (or possibly 100s) of variables and users will only care about a few of them so fully merging all of them with a working dataset will almost never be the intended use (e.g. nirvana/vep)

My more opinionated view of this is that the ex situ updates are best in larger, more sophisticated workflows. I say that because I think those workflows will be very much dominated by Xarray code with calls to individual sgkit functions interspersed within it, so the need for chaining sgkit operations is not very high and the need to transform/interrogate results before merging them in is higher.

It would be great to find some way to get all of those advantages above though. Perhaps there is a way to do something like this more elegantly:

# Assuming `fn` makes inplace updates
ds_only_new_vars = fn(ds)[fn.output_variables]

A potential issue with this is that functions may rely on variables from other functions (see https://github.com/pystatgen/sgkit/pull/102#issue-465519140) and add them when necessary, which would mean the set of output variables isn’t always the same. I don’t think this is actually an issue with Dask as long as the definition of the variables is fast. It would be for most functions and Dask will remove any redundancies, so presumably we wouldn’t have to necessarily return or add them (i.e. this affects both the in situ and ex situ strategies). Using numpy variables definitely wouldn’t work that way but I’m not sure that use case is as practical outside of testing anyhow.

Some of @ravwojdyla’s work to “schemify” function results would make it easy to know what variables a function produces (https://github.com/pystatgen/sgkit/issues/43#issuecomment-669478348). There could be some overlap here with how that works out too.

Issue Analytics

State:
Created 3 years ago
Comments:19 (1 by maintainers)

Top GitHub Comments

2reactions

eric-czechcommented, Aug 12, 2020

Do I have this right?

Exactly.

for new users it’s easy to explain that everything is accumulated in the newly returned dataset. This is the model that I have in my mind, possibly influenced by Scanpy

Hail does it too and I agree it’s one less thing for new users to learn. In the interest of moving this forward, it sounds like we’re all in agreement on https://github.com/pystatgen/sgkit/issues/103#issuecomment-672091886 (the numbered steps a few posts above) with the only difference being we’ll call the boolean flag merge.

1reaction

timothymillarcommented, Aug 13, 2020

Related to this what happens if a value in the input-data set is re-calculated? will this be common? For example there are multiple methods for calculating expected heterozygosity, will every method produce a unique variable name or will they or result in the same variable name to ease downstream use?

If a user re-calculates 'heterozygosity_expected' (with merge=True) on a dataset that already contains a value of that name then they should probably see a warning something like:

MergeWarning: The following values in the input dataset will be replaced in the output: 'heterozygosity_expected'

The warnings can then be promoted to errors in ‘production’ pipelines ensure that there is no variable clobbering.

One other thing I’d like to bring up here is that ideas were floated in the past to add arguments to control the names of the variables that get created.

A simple minded approach to this would be to include the arguments map_inputs and/or map_outputs which each take a dict mapping the ‘default’ variable name to custom names.