Supporting out-of-core computation/indexing for very large indexes
See original GitHub issue(Follow-up of discussion here https://github.com/pydata/xarray/pull/1024#issuecomment-258524115).
xarray + dask.array successfully enable out-of-core computation for very large variables that doesn’t fit in memory. One current limitation is that the indexes of a Dataset
or DataArray
, which rely on pandas.Index
, are still fully loaded into memory (it will be soon loaded eagerly after #1024). In many cases this is not a problem, as the sizes of 1-dimensional indexes are usually much smaller than the sizes of n-dimensional variables or coordinates.
However, this may be problematic in some specific cases where we have to deal with very large indexes. As an example, big unstructured meshes often have coordinates (x, y, z) arranged as 1-d arrays of length that equals the number of nodes, which can be very large!! (See, e.g., ugrid conventions).
It would be very nice if xarray could also help for these use cases. Therefore I’m wondering if (and how) out-of-core support can be extended to indexes and indexing.
I’ve briefly looked at the documentation on dask.dataframe
, and a first naive approach I have in mind would be to allow partitioning an index into multiple, contiguous indexes. For label-based indexing, we might for example map indexing.convert_label_indexer
to each partition and combine the returned indexers.
My knowledge of dask is very limited, though. So I’ve no doubt that this suggestion is very simplistic and not very efficient, or that there are better approaches. I’m also certainly missing other issues not directly related to indexing.
Any thoughts?
Issue Analytics
- State:
- Created 7 years ago
- Comments:5 (4 by maintainers)
Top GitHub Comments
For unstructured meshes of points, pandas.MultiIndex is not the right abstraction.
Suppose you have a (very long) list of sorted points
(x, y, z)
in a multi-index. You can efficiently query within fixed bounds alongx
by doing binary search. But for queries iny
andz
, you cannot do any better than looking through the entire list. Moreover, pandas.MultiIndex factorizes each level into unique values, which is a complete waste on an unstructured grid where few coordinate overlap.For unstructured meshes, you need something like a KDTree (see discussion in https://github.com/pydata/xarray/issues/475), with ideally with nearby points in space stored in contiguous array chunks.
I would start with trying to get an in-memory KDTree working, and then switch to something out of core only when/if necessary. For example, SciPy’s cKDTree can load 1e7 points in 3-dimensions in only a few seconds:
The might be good enough.
Should this and https://github.com/pydata/xarray/issues/1650 be consolidated into a single issue? I think that they’re duplicates of eachother.