question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Append output variables from functions to input dataset

See original GitHub issue

Not every function will fit the fn(ds: Dataset, ...) -> Dataset signature, but for the large majority that do we have so far been adopting a convention that only returns newly created variables. Another option would be to always try to write those variables into the input dataset. For lack of a better phrase I’ll call the former “Ex situ” updates and the latter “In situ” updates. Here are some pros/cons of each:

Ex situ updates

  • Advantages
    • It makes it easier to transform/interrogate results before deciding to merge them
    • Merging datasets in Xarray is trivial when there is no index but if result variables are indexed and the provided dataset is not, there is any argument to be made in leaving what to do up to the user
    • Function style and it’s less code to write/maintain
  • Disadvantages
    • ds.merge(fn(ds)) is clunky if you wanted the variables merged in the first place
    • It’s not as nice in pipelines, e.g. you have to do this instead: ds = xr.merge([ fn1(ds), fn2(ds) ])

In situ updates

  • Advantages
    • Pipelines are nicer, e.g. ds.pipe(count_alleles).pipe(allele_frequency)
  • Disadvantages
    • We’re likely to have some functions that produce 10s (or possibly 100s) of variables and users will only care about a few of them so fully merging all of them with a working dataset will almost never be the intended use (e.g. nirvana/vep)

My more opinionated view of this is that the ex situ updates are best in larger, more sophisticated workflows. I say that because I think those workflows will be very much dominated by Xarray code with calls to individual sgkit functions interspersed within it, so the need for chaining sgkit operations is not very high and the need to transform/interrogate results before merging them in is higher.

It would be great to find some way to get all of those advantages above though. Perhaps there is a way to do something like this more elegantly:

# Assuming `fn` makes inplace updates
ds_only_new_vars = fn(ds)[fn.output_variables]

A potential issue with this is that functions may rely on variables from other functions (see https://github.com/pystatgen/sgkit/pull/102#issue-465519140) and add them when necessary, which would mean the set of output variables isn’t always the same. I don’t think this is actually an issue with Dask as long as the definition of the variables is fast. It would be for most functions and Dask will remove any redundancies, so presumably we wouldn’t have to necessarily return or add them (i.e. this affects both the in situ and ex situ strategies). Using numpy variables definitely wouldn’t work that way but I’m not sure that use case is as practical outside of testing anyhow.

Some of @ravwojdyla’s work to “schemify” function results would make it easy to know what variables a function produces (https://github.com/pystatgen/sgkit/issues/43#issuecomment-669478348). There could be some overlap here with how that works out too.

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:19 (1 by maintainers)

github_iconTop GitHub Comments

2reactions
eric-czechcommented, Aug 12, 2020

Do I have this right?

Exactly.

for new users it’s easy to explain that everything is accumulated in the newly returned dataset. This is the model that I have in my mind, possibly influenced by Scanpy

Hail does it too and I agree it’s one less thing for new users to learn. In the interest of moving this forward, it sounds like we’re all in agreement on https://github.com/pystatgen/sgkit/issues/103#issuecomment-672091886 (the numbered steps a few posts above) with the only difference being we’ll call the boolean flag merge.

1reaction
timothymillarcommented, Aug 13, 2020

Related to this what happens if a value in the input-data set is re-calculated? will this be common? For example there are multiple methods for calculating expected heterozygosity, will every method produce a unique variable name or will they or result in the same variable name to ease downstream use?

If a user re-calculates 'heterozygosity_expected' (with merge=True) on a dataset that already contains a value of that name then they should probably see a warning something like:

MergeWarning: The following values in the input dataset will be replaced in the output: 'heterozygosity_expected'

The warnings can then be promoted to errors in ‘production’ pipelines ensure that there is no variable clobbering.

One other thing I’d like to bring up here is that ideas were floated in the past to add arguments to control the names of the variables that get created.

A simple minded approach to this would be to include the arguments map_inputs and/or map_outputs which each take a dict mapping the ‘default’ variable name to custom names.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Append Variable Activity - Azure Data Factory & Azure Synapse
Use the Append Variable activity to add a value to an existing array variable defined in a Data Factory or Synapse Analytics pipeline....
Read more >
Add multiple output variables using purrr and a predefined ...
r - Add multiple output variables using purrr and a predefined function - Stack Overflow. Stack Overflow for Teams – Start collaborating and ......
Read more >
Concatenating Data Sets By Using the APPEND Procedure
This output illustrates two important points about using PROC APPEND to concatenate data sets with different variables:.
Read more >
Compute and Add new Variables to a Data Frame in R
This tutorial describes how to compute and add new variables to a data frame in R. You will learn the following R functions...
Read more >
SAS : Combining and Aggregating Data - ListenData
This tutorial explains how to combine / append two data sets in SAS. In SAS, there are various method to append data sets....
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found