question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Supporting out-of-core computation/indexing for very large indexes

See original GitHub issue

(Follow-up of discussion here https://github.com/pydata/xarray/pull/1024#issuecomment-258524115).

xarray + dask.array successfully enable out-of-core computation for very large variables that doesn’t fit in memory. One current limitation is that the indexes of a Dataset or DataArray, which rely on pandas.Index, are still fully loaded into memory (it will be soon loaded eagerly after #1024). In many cases this is not a problem, as the sizes of 1-dimensional indexes are usually much smaller than the sizes of n-dimensional variables or coordinates.

However, this may be problematic in some specific cases where we have to deal with very large indexes. As an example, big unstructured meshes often have coordinates (x, y, z) arranged as 1-d arrays of length that equals the number of nodes, which can be very large!! (See, e.g., ugrid conventions).

It would be very nice if xarray could also help for these use cases. Therefore I’m wondering if (and how) out-of-core support can be extended to indexes and indexing.

I’ve briefly looked at the documentation on dask.dataframe, and a first naive approach I have in mind would be to allow partitioning an index into multiple, contiguous indexes. For label-based indexing, we might for example map indexing.convert_label_indexer to each partition and combine the returned indexers.

My knowledge of dask is very limited, though. So I’ve no doubt that this suggestion is very simplistic and not very efficient, or that there are better approaches. I’m also certainly missing other issues not directly related to indexing.

Any thoughts?

cc @shoyer @mrocklin

Issue Analytics

  • State:open
  • Created 7 years ago
  • Comments:5 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
shoyercommented, Nov 8, 2016

For unstructured meshes of points, pandas.MultiIndex is not the right abstraction.

Suppose you have a (very long) list of sorted points (x, y, z) in a multi-index. You can efficiently query within fixed bounds along x by doing binary search. But for queries in y and z, you cannot do any better than looking through the entire list. Moreover, pandas.MultiIndex factorizes each level into unique values, which is a complete waste on an unstructured grid where few coordinate overlap.

For unstructured meshes, you need something like a KDTree (see discussion in https://github.com/pydata/xarray/issues/475), with ideally with nearby points in space stored in contiguous array chunks.

I would start with trying to get an in-memory KDTree working, and then switch to something out of core only when/if necessary. For example, SciPy’s cKDTree can load 1e7 points in 3-dimensions in only a few seconds:

x = np.random.rand(int(1e7), 3)
%time tree = scipy.spatial.cKDTree(x, leafsize=100)
# CPU times: user 2.58 s, sys: 0 ns, total: 2.58 s
# Wall time: 2.55 s

The might be good enough.

0reactions
TomAugspurgercommented, Jan 26, 2021

Should this and https://github.com/pydata/xarray/issues/1650 be consolidated into a single issue? I think that they’re duplicates of eachother.

Read more comments on GitHub >

github_iconTop Results From Across the Web

High resolution time series; open_zarr question - Science
My question is, I have a very high density data (seconds) resolution ... Supporting out-of-core computation/indexing for very large indexes.
Read more >
Indexing Very Large Tables - Towards Data Science
A short guide to the best practices around indexing large tables and how to use partitioning to ease the load on indexing.
Read more >
FIX: High CPU use when large index is used in a query on a ...
Fixes a problem in which high CPU usage occurs when a large index is used in a query execution on a memory-optimized table...
Read more >
Support - Indexing enormous sites - Zoom Search Engine
When indexing millions of pages, the output index files can be very large. FAT32 unfortunately only supports files up to 4GB in size...
Read more >
Overlay Indexes: Efficiently Supporting Aggregate Range ...
Commercial off-the-shelf DataBase Management Systems (DBMSes) are highly optimized to process a wide range of queries by means of carefully ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found