Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

How should xarray use/support sparse arrays?

See original GitHub issue

I’m looking forward to being easily able to create sparse xarray objects from pandas: https://github.com/pydata/xarray/issues/3206

Are there other xarray APIs that could make good use of sparse arrays, or could make sparse arrays easier to use?

Some ideas:

to_sparse()/to_dense() methods for converting to/from sparse without requiring using .data
to_dataframe()/to_series() could grow options for skipping the fill-value in sparse arrays, so they can round-trip MultiIndex data back to pandas
Serialization to/from netCDF files, using some custom convention (see https://github.com/pydata/xarray/issues/1375#issuecomment-402699810)

Issue Analytics

State:
Created 4 years ago
Reactions:12
Comments:47 (18 by maintainers)

Top GitHub Comments

2reactions

Material-Scientistcommented, Jan 16, 2022

I would prefer to retain the dense representation, but with tricks to keep the data of sparse type in memory.

Look at the following example with pandas multiindex & sparse dtype:

The dense data uses ~40 MB of memory, while the dense representation with sparse dtypes uses only ~0.5 kB of memory!

And while you can import dataframes with the sparse=True keyword, the size seems to be displayed inaccurately (both are the same size?), and we cannot examine the data like we can with pandas multiindex + sparse dtype:

Besides, a lot of operations are not available on sparse xarray data variables (i.e. if I wanted to group by price level for ffill & downsampling):

So, it would be nice if xarray adopted pandas’ approach of unstacking sparse data.

In the end, you could extract all the non-NaN values and write them to a sparse storage format, such as TileDB sparse arrays. cc: @stavrospapadopoulos

2reactions

fjanooscommented, Aug 30, 2019

Would it be possible that pd.{Series, DataFrame}.to_xarray() automatically creates a sparse dataarray - or we have a flag in to_xarray which allows controlling for this. I have a very sparse dataframe and everytime I try to convert it to xarray I blow out my memory. Keeping it sparse but logically as a DataArray would be fantastic.