Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Extending Xarray for domain-specific toolkits

See original GitHub issue

Hi, I have a question about how to design an API over Xarray for a domain-specific use case (in genetics). Having seen the following now:

I wanted to reach out and seek some advice on what I’d like to do given that I don’t think any of the solutions there are what I’m looking for.

More specifically, I would like to model the datasets we work with as xr.Dataset subtypes but I’d like to enforce certain preconditions for those types as well as support conversions between them. An example would be that I may have a domain-specific type GenotypeDataset that should always contain 3 DataArrays and each of those arrays should meet different dtype and dimensionality constraints. That type may be converted to another type, say HaplotypeDataset, where the underlying data goes through some kind of transformation to produce a lower dimensional form more amenable to a specific class of algorithms.

One API I envision around these models consists of functions that enforce nominal typing on Xarray classes, so in that case I don’t actually care if my subtypes are preserved by Xarray when operations are run. It would be nice if that subtyping wasn’t lost but I can understand that it’s a limitation for now. Here’s an example of what I mean:

from genetics import api

arr1 = ??? # some 3D integer DataArray of allele indices
arr2 = ??? # A missing data boolean DataArray
arr3 = ??? # Some other domain-specific stuff like variant phasing
ds = api.GenotypeDataset(arr1, arr2, arr3)

# A function that would be in the API would look like:
def analyze_haplotype(ds: xr.Dataset) -> xr.Dataset:
    # Do stuff assuming that the user has supplied a dataset compliant with 
    # the "HaplotypeDataset" constraints
    pass 

analyze_haplotype(ds.to_haplotype_dataset())

I like the idea of trying to avoid requiring API-specific data structures for all functionality in favor of conventions over Xarray data structures. I think conveniences like these subtypes would be great for enforcing those conventions (rather than checking at the beginning of each function) as well as making it easier to go between representations, but I’m certainly open to suggestion. I think something akin to structural subtyping that extends to what arrays are contained in the Dataset, how coordinates are named, what datatypes are used, etc. would be great but I have no idea if that’s possible.

All that said, is it still a bad idea to try to subclass Xarray data structures even if the intent was never to touch any part of the internal APIs? I noticed Xarray does some stuff like type(array)(...) internally but that’s the only catch I’ve found so far (which I worked around by dispatching to constructors based on the arguments given).

cc: @alimanfoo - Alistair raised some concerns about trying this to me, so he may have some thoughts here too

Issue Analytics

State:
Created 3 years ago
Comments:10 (2 by maintainers)

Top GitHub Comments

1reaction

eric-czechcommented, Apr 13, 2020

Thanks again @keewis! I moved the static typing discussion to https://github.com/pydata/xarray/issues/3967.

This is closed out now as far as I’m concerned.

0reactions

keewiscommented, Apr 12, 2020

Is there any reason not to put the name of the type into attrs and just switch on that rather than the keys in data_vars?

Not really, I just thought the variables in the dataset were a way to uniquely identify its variant (i.e. do the validation of the dataset’s structure). If you have different means to do so, of course you can use that instead.

Re TypedDict: the PEP introducing TypedDict especially mentions that it is only intended for Dict[str, Any] (so no subclasses of Dict for TypedDict). However, looking at the code of TypedDict, we should be able to do something similar for Dataset.

Edit: we’d still need to convince mypy that the custom TypedDict is a type…

so I’m curious if that has been discussed much

I don’t think so? There were a few discussions about subclassing, but I couldn’t find anything about static type analysis. It’s definitely worth having this discussion, either here (repurposing this issue) or in a new issue.

Top Results From Across the Web

Extending xarray

One standard solution to this problem is to subclass Dataset and/or DataArray to add domain specific functionality. However, inheritance is not very robust....

[public] Xarray - 2021 Proposal - Zenodo

Xarray is an open source scientific Python project that provides a data model and toolkit for multidimensional labeled arrays and datasets.

Frequently Asked Questions — xarray 0.10.4 documentation

xgcm: Extends the xarray data model to understand finite volume grid cells (common in General Circulation Models) and provides interpolation and difference ...

xarray: N-D labeled Arrays and Datasets in Python

xarray implements data structures and an analytics toolkit for ... Finally, xarray interfaces with existing domain-specific packages such as ...

Overview - Pythia Foundations

The Xarray library was therefore created to extend the labelled array concept ... their Python ARM Radar Toolkit (Py-ART) for analysing weather radar...