How should xarray use/support sparse arrays?
See original GitHub issueI’m looking forward to being easily able to create sparse xarray objects from pandas: https://github.com/pydata/xarray/issues/3206
Are there other xarray APIs that could make good use of sparse arrays, or could make sparse arrays easier to use?
Some ideas:
to_sparse()
/to_dense()
methods for converting to/from sparse without requiring using.data
to_dataframe()
/to_series()
could grow options for skipping the fill-value in sparse arrays, so they can round-trip MultiIndex data back to pandas- Serialization to/from netCDF files, using some custom convention (see https://github.com/pydata/xarray/issues/1375#issuecomment-402699810)
Issue Analytics
- State:
- Created 4 years ago
- Reactions:12
- Comments:47 (18 by maintainers)
Top Results From Across the Web
Sparse arrays and the CESM land model component
Usually we work with Xarray wrapping a Dask array which in turn uses NumPy arrays for each block; or just Xarray wrapping NumPy...
Read more >How to make use of xarray's sparse functionality when ...
Either: arr_a = arr_a.map_blocks(sparse.COO) arr_b = arr_b.map_blocks(sparse.COO). Or: xr1 = xarray.apply_ufunc(sparse.
Read more >xarray.DataArray.from_series
If the series's index is a MultiIndex, it will be expanded into a tensor product ... If sparse=True, creates a sparse array instead...
Read more >Construct Sparse Arrays - PyData/Sparse
You can construct COO arrays from coordinates and value data. ... Each row of coords contains one dimension of the desired sparse array,...
Read more >Sparse Arrays - Dask documentation
By swapping out in-memory NumPy arrays with in-memory sparse arrays, we can reuse the blocked algorithms of Dask's Array to achieve parallel and...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
I would prefer to retain the dense representation, but with tricks to keep the data of sparse type in memory.
Look at the following example with pandas multiindex & sparse dtype:
The dense data uses ~40 MB of memory, while the dense representation with sparse dtypes uses only ~0.5 kB of memory!
And while you can import dataframes with the sparse=True keyword, the size seems to be displayed inaccurately (both are the same size?), and we cannot examine the data like we can with pandas multiindex + sparse dtype:
Besides, a lot of operations are not available on sparse xarray data variables (i.e. if I wanted to group by price level for ffill & downsampling):
So, it would be nice if xarray adopted pandas’ approach of unstacking sparse data.
In the end, you could extract all the non-NaN values and write them to a sparse storage format, such as TileDB sparse arrays. cc: @stavrospapadopoulos
Would it be possible that pd.{Series, DataFrame}.to_xarray() automatically creates a sparse dataarray - or we have a flag in to_xarray which allows controlling for this. I have a very sparse dataframe and everytime I try to convert it to xarray I blow out my memory. Keeping it sparse but logically as a DataArray would be fantastic.