concatenating files
See original GitHub issueI’m not sure if this should go here or to the main fsspec repo.
I’d like to use the reference filesystem to concatenate pieces. That is, for one reference, I’d like to specify a list of things to join together for making up a new file in stead of just having a single pointer. This could roughly look like:
"refs": {
"key0": ["data", ["http://target_url", 10000, 100]],
"key1": [["http://target_url", 10000, 100], ["http://{{u}}", 10000, 100]],
}
etc… Using that method, one could completely rearrange existing files. In my current application, I’d like to join existing chunks of uncompressed netCDF file into a single larger chunk to be used within zarr.
A potential issue might be, that the following would become ambiguous:
"refs": {
"key0": ["https://test"],
}
This could refer to either just a single piece of raw data containing the text “https://test” or a reference to the entire object behind the link. However, it should be possible to disambiguate this, by defining that single-element raw data blocks must always be written without the list.
Issue Analytics
- State:
- Created 2 years ago
- Reactions:2
- Comments:21 (8 by maintainers)
Top GitHub Comments
@martindurant I’ve experimented a bit using some ideas from above: https://github.com/d70-t/preffs It’s a very rudimentary and likely broken implementation of a parquet-based reference filesystem (with eager parquet-loading tough). It supports, references, in-line data and concatenation. I used it to bring down a 6.8 GB reference-JSON (about 60 Million entries) down to 360 MB parquet. The loading time went down from over 20 minutes for json to less than 1 minute for parquet. (Those numbers are all with v0 references, as jinja2 slowed down the things too much to wait for it)
I’ll likely not have the time to work on that project, but it seems to be a very useful direction to go for (especially if lazy loading is on the horizon), so I thought I’ll share it. Maybe there’s someone else to find some time?
Perfect! Yes, I’d suspect that for most use cases, the inlined keys should be a tiny fraction of the keys. Probably it will be best to just do a lookup for both: first into a small kv-store with inlined keys and then into the large table of references.