question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

API for reshaping DataArrays as 2D "data matrices" for use in machine learning

See original GitHub issue

Machine learning and linear algebra problems are often expressed in terms of operations on matrices rather than arrays of arbitrary dimension, and there is currently no convenient way to turn DataArrays (or combinations of DataArrays) into a single “data matrix”.

As an example, I have needed to use scikit-learn lately with data from DataArray objects. Scikit-learn requires the data to be expressed in terms of simple 2-dimensional matrices. The rows are called samples, and the columns are known as features. It is annoying and error to transpose and reshape a data array by hand to fit into this format. For instance, this gituhub repo for xarray aware sklearn-like objects devotes many lines of code to massaging data arrays into data matrices. I think that this reshaping workflow might be common enough to warrant some kind of treatment in xarray.

I have written some code in this gist, that have found pretty convenient for doing this. This gist has an XRReshaper class which can be used for reshaping data to and from a matrix format. The basic usage for an EOF analysis of a dataset A(lat, lon, time) can be done like this

feature_dims = ['lat', 'lon']

rs = XRReshaper(A)
data_matrix, _ = rs.to(feature_dims)

# Some linear algebra or machine learning
_,_, eofs = svd(data_matrix)

eofs_datarray = rs.get(eofs[0], ['mode'] + feature_dims)

I am not sure this is the best API, but it seems to work pretty well and I have used it here to implement some xarray-aware sklearn-like objects for PCA, which can be used like

feature_dims = ['lat', 'lon']
pca = XPCA(feature_dims, n_components=10, weight=cos(A.lat))
pca.fit(A)
pca.transform(A)
eofs = pca.components_

Another syntax which might be helpful is some kind of context manager approach like

with XRReshaper(A) as rs, data_matrix:
     # do some stuff with data_matrix
# use rs to restore output to a data array.

Issue Analytics

  • State:closed
  • Created 6 years ago
  • Comments:9 (9 by maintainers)

github_iconTop GitHub Comments

1reaction
nbren12commented, Mar 23, 2017

Cool! Thanks for that link. As far as the API is concerned, I think I like the ReshapeCoder approach a little better because it does not require keeping track of a feature_dims vector list throughout the code, like my class does. It also could generalize beyond just creating a 2D array.

To produce a dataset B(samples,features) from a dataset A(x,y,z,t) how do you feel about a syntax like this:

rs = Reshaper(dict(samples=('t',), features=('x', 'y', 'z')), coords=A.coords)

B = rs.encode(A)


_,_,eofs =svd(B.data)

# eofs is now a 2D dask array so we need to give 
# it dimension information
eof_dims = ['mode', 'features']
rs.decode(eofs, eof_dims)

# to decode XArray object we don't need to pass dimension info 
rs.decode(B)

On the other hand, it would be nice to be able to reshape data through a syntax like

A.reshape.encode(dict(...))
0reactions
nbren12commented, Oct 19, 2017

Sorry. I guess I should have made my last comment in the PR.

Read more comments on GitHub >

github_iconTop Results From Across the Web

How to Index, Slice and Reshape NumPy Arrays for Machine ...
How to access data using Pythonic indexing and slicing. How to resize your data to meet the expectations of some machine learning APIs....
Read more >
Reshaping and reorganizing data - Xarray
Reshaping and reorganizing data#. These methods allow you to reorganize your data by changing dimensions, array shape, order of values, or indexes.
Read more >
Creates a tf.Tensor with the provided values, shape and dtype.
We have utility functions for common cases like Scalar, 1D, 2D, 3D and 4D tensors, as well a number of functions to initialize...
Read more >
Convolutional Neural Networks with TensorFlow - DataCamp
TensorFlow is a popular deep learning framework. ... your images as a matrix, reshape your data and rescale the images between 0 and...
Read more >
Data Matrix ECC200 2D Barcode Tutorial | BarcodeFAQ.com
How to generate, encode, print and verify the Data Matrix ECC-200 2D barcode ... Amount of Data Encoded; DataMatrix Encoding Modes; Control Characters ......
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found