question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Supporting range queries

See original GitHub issue

Not even sure if this makes sense. Still thinking through this myself. So maybe this is just a starting point for this discussion. Though this has been coming up in a few places.

Is there a way with fsspec to perform range queries? Or could there be?

Basically thinking about this from the Zarr side where we are increasingly interested in being able to select out portions of chunks. For this range queries would be useful for selecting out this portion.

cc @joshmoore @rabernat

Some related discussion in these issues:

Issue Analytics

  • State:open
  • Created 2 years ago
  • Reactions:1
  • Comments:12 (6 by maintainers)

github_iconTop GitHub Comments

1reaction
rjzamoracommented, Sep 24, 2021

it sounds like you have done similar things with tabular data loading on GCP recently IIUC

Yes - cudf#9265 was recently merged as a temporary workaround for the fact that cudf cannot seek/read from an fsspec file-like object. Before that PR was merged, cudf would always read the entire remote file into a host memory buffer, even for partial IO. The “simple” workaround was to transfer only the necessary byte ranges into the local buffer (in parallel). Martin’s cat_ranges PR was not used in the cudf change, but it probably will be in the near future. The new cat_ranges API makes it easy to efficiently transfer a specific set of byte ranges with a single line of code. The only logic that the downstream library needs to worry about is the calculation of the specific byte ranges to pass to cat_ranges.

If you are working with a library that is able to read/seek from an fsspec file-like object, then the best approach is likely to gather known bytes ranges with cat_ranges, and then to open the remote file with the new ”parts” caching strategy. Note that I plan to add this optimization to Dask for read_parquet and byte_range-based read_csv.

1reaction
martindurantcommented, Sep 22, 2021

related, @jakirkham : zarr cannot currently read portions of a key, specifically for the case where the storage target is not compressed. I believe it can read selective blosc blocks (and zstd, in particular, would be very doable). Such functionality would be very helpful in a number of access patterns.

@rabernat : I don’t know how h5py achieves this either, exactly, but I assume it must be compiled against the python interpreter and really asks it to call the (dynamic) methods on the objects passed. A similar issue in rasterio: https://github.com/mapbox/rasterio/pull/2141

Read more comments on GitHub >

github_iconTop Results From Across the Web

On Private Information Retrieval Supporting Range Queries
Private information retrieval (PIR) allows a client to retrieve data from a database without the database server learning what data is being ...
Read more >
Range query (data structures) - Wikipedia
In data structures, a range query consists of preprocessing some input data into a data structure to efficiently answer any number of queries...
Read more >
Mercury: Supporting Scalable Multi-Attribute Range Queries∗
This paper presents the design of Mercury, a scalable pro- tocol for supporting multi-attribute range-based searches. Mercury differs from previous ...
Read more >
Solving Range Queries in a Distributed System
In this project, we focus on supporting range queries on single dimension. Our approach is to combine ideas from Ex- tendible Hashing [2]...
Read more >
Efficient Support for Range Queries and Range Updates ...
Linearizable range queries can be performed in a Snap tree by first creating a clone and then performing the query in the clone....
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found