Stack + to_array before to_xarray is much faster that a simple to_xarray
See original GitHub issueI was seeing some slow performance around to_xarray()
on MultiIndexed series, and found that unstacking one of the dimensions before running to_xarray()
, and then restacking with to_array()
was ~30x faster. This time difference is consistent with larger data sizes.
To reproduce:
Create a series with a MultiIndex, ensuring the MultiIndex isn’t a simple product:
s = pd.Series(
np.random.rand(100000),
index=pd.MultiIndex.from_product([
list('abcdefhijk'),
list('abcdefhijk'),
pd.DatetimeIndex(start='2000-01-01', periods=1000, freq='B'),
]))
cropped = s[::3]
cropped.index=pd.MultiIndex.from_tuples(cropped.index, names=list('xyz'))
cropped.head()
# x y z
# a a 2000-01-03 0.993989
# 2000-01-06 0.850518
# 2000-01-11 0.068944
# 2000-01-14 0.237197
# 2000-01-19 0.784254
# dtype: float64
Two approaches for getting this into xarray;
1 - Simple .to_xarray()
:
# current_method = cropped.to_xarray()
<xarray.DataArray (x: 10, y: 10, z: 1000)>
array([[[0.993989, nan, ..., nan, 0.721663],
[ nan, nan, ..., 0.58224 , nan],
...,
[ nan, 0.369382, ..., nan, nan],
[0.98558 , nan, ..., nan, 0.403732]],
[[ nan, nan, ..., 0.493711, nan],
[ nan, 0.126761, ..., nan, nan],
...,
[0.976758, nan, ..., nan, 0.816612],
[ nan, nan, ..., 0.982128, nan]],
...,
[[ nan, 0.971525, ..., nan, nan],
[0.146774, nan, ..., nan, 0.419806],
...,
[ nan, nan, ..., 0.700764, nan],
[ nan, 0.502058, ..., nan, nan]],
[[0.246768, nan, ..., nan, 0.079266],
[ nan, nan, ..., 0.802297, nan],
...,
[ nan, 0.636698, ..., nan, nan],
[0.025195, nan, ..., nan, 0.629305]]])
Coordinates:
* x (x) object 'a' 'b' 'c' 'd' 'e' 'f' 'h' 'i' 'j' 'k'
* y (y) object 'a' 'b' 'c' 'd' 'e' 'f' 'h' 'i' 'j' 'k'
* z (z) datetime64[ns] 2000-01-03 2000-01-04 ... 2003-10-30 2003-10-31
This takes 536 ms
2 - unstack in pandas first, and then use to_array
to do the equivalent of a restack:
proposed_version = (
cropped
.unstack('y')
.to_xarray()
.to_array('y')
)
This takes 17.3 ms
To confirm these are identical:
proposed_version_adj = (
proposed_version
.assign_coords(y=proposed_version['y'].astype(object))
.transpose(*current_version.dims)
)
proposed_version_adj.equals(current_version)
# True
Problem description
A default operation is much slower than a (potentially) equivalent operation that’s not the default.
I need to look more at what’s causing the issues. I think it’s to do with the .reindex(full_idx)
, but I’m unclear why it’s so much faster in the alternative route, and whether there’s a fix that we can make to make the default path fast.
Output of xr.show_versions()
xarray: 0.10.9 pandas: 0.23.4 numpy: 1.15.2 scipy: 1.1.0 netCDF4: None h5netcdf: None h5py: None Nio: None zarr: None cftime: None PseudonetCDF: None rasterio: None iris: None bottleneck: 1.2.1 cyordereddict: None dask: None distributed: None matplotlib: 2.2.3 cartopy: 0.16.0 seaborn: 0.9.0 setuptools: 40.4.3 pip: 18.0 conda: None pytest: 3.8.1 IPython: 5.8.0 sphinx: None
Issue Analytics
- State:
- Created 5 years ago
- Reactions:1
- Comments:13 (7 by maintainers)
@tqfjo unrelated. You’re comparing the creation of a dataset with 2 variables with the creation of one with 3000. Unsurprisingly, the latter will take 1500x. If your dataset doesn’t functionally contain 3000 variables but just a single two-dimensional variable, use
xarray.DataArray(ds)
.Very good news! Thanks for implementing it!