API for reshaping DataArrays as 2D "data matrices" for use in machine learning
See original GitHub issueMachine learning and linear algebra problems are often expressed in terms of operations on matrices rather than arrays of arbitrary dimension, and there is currently no convenient way to turn DataArrays (or combinations of DataArrays) into a single “data matrix”.
As an example, I have needed to use scikit-learn lately with data from DataArray objects. Scikit-learn requires the data to be expressed in terms of simple 2-dimensional matrices. The rows are called samples, and the columns are known as features. It is annoying and error to transpose and reshape a data array by hand to fit into this format. For instance, this gituhub repo for xarray aware sklearn-like objects devotes many lines of code to massaging data arrays into data matrices. I think that this reshaping workflow might be common enough to warrant some kind of treatment in xarray.
I have written some code in this gist, that have found pretty convenient for doing this. This gist has an XRReshaper class which can be used for reshaping data to and from a matrix format. The basic usage for an EOF analysis of a dataset A(lat, lon, time) can be done like this
feature_dims = ['lat', 'lon']
rs = XRReshaper(A)
data_matrix, _ = rs.to(feature_dims)
# Some linear algebra or machine learning
_,_, eofs = svd(data_matrix)
eofs_datarray = rs.get(eofs[0], ['mode'] + feature_dims)
I am not sure this is the best API, but it seems to work pretty well and I have used it here to implement some xarray-aware sklearn-like objects for PCA, which can be used like
feature_dims = ['lat', 'lon']
pca = XPCA(feature_dims, n_components=10, weight=cos(A.lat))
pca.fit(A)
pca.transform(A)
eofs = pca.components_
Another syntax which might be helpful is some kind of context manager approach like
with XRReshaper(A) as rs, data_matrix:
# do some stuff with data_matrix
# use rs to restore output to a data array.
Issue Analytics
- State:
- Created 6 years ago
- Comments:9 (9 by maintainers)

Top Related StackOverflow Question
Cool! Thanks for that link. As far as the API is concerned, I think I like the
ReshapeCoderapproach a little better because it does not require keeping track of afeature_dimsvector list throughout the code, like my class does. It also could generalize beyond just creating a 2D array.To produce a dataset
B(samples,features)from a datasetA(x,y,z,t)how do you feel about a syntax like this:On the other hand, it would be nice to be able to reshape data through a syntax like
Sorry. I guess I should have made my last comment in the PR.