Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

concatenating files

See original GitHub issue

I’m not sure if this should go here or to the main fsspec repo.

I’d like to use the reference filesystem to concatenate pieces. That is, for one reference, I’d like to specify a list of things to join together for making up a new file in stead of just having a single pointer. This could roughly look like:

"refs": {
      "key0": ["data", ["http://target_url", 10000, 100]],
      "key1": [["http://target_url", 10000, 100], ["http://{{u}}", 10000, 100]],
    }

etc… Using that method, one could completely rearrange existing files. In my current application, I’d like to join existing chunks of uncompressed netCDF file into a single larger chunk to be used within zarr.

A potential issue might be, that the following would become ambiguous:

"refs": {
      "key0": ["https://test"],
    }

This could refer to either just a single piece of raw data containing the text “https://test” or a reference to the entire object behind the link. However, it should be possible to disambiguate this, by defining that single-element raw data blocks must always be written without the list.

Issue Analytics

State:
Created 2 years ago
Reactions:2
Comments:21 (8 by maintainers)

Top GitHub Comments

1reaction

d70-tcommented, Jul 5, 2022

@martindurant I’ve experimented a bit using some ideas from above: https://github.com/d70-t/preffs It’s a very rudimentary and likely broken implementation of a parquet-based reference filesystem (with eager parquet-loading tough). It supports, references, in-line data and concatenation. I used it to bring down a 6.8 GB reference-JSON (about 60 Million entries) down to 360 MB parquet. The loading time went down from over 20 minutes for json to less than 1 minute for parquet. (Those numbers are all with v0 references, as jinja2 slowed down the things too much to wait for it)

I’ll likely not have the time to work on that project, but it seems to be a very useful direction to go for (especially if lazy loading is on the horizon), so I thought I’ll share it. Maybe there’s someone else to find some time?

0reactions

d70-tcommented, Mar 30, 2022

Note that the non-key data and inlined keys would fit naturally into parquet’s user key-value metadata store (so long as it’s relatively small).

Perfect! Yes, I’d suspect that for most use cases, the inlined keys should be a tiny fraction of the keys. Probably it will be best to just do a lookup for both: first into a small kv-store with inlined keys and then into the large table of references.