question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

concatenating files

See original GitHub issue

I’m not sure if this should go here or to the main fsspec repo.

I’d like to use the reference filesystem to concatenate pieces. That is, for one reference, I’d like to specify a list of things to join together for making up a new file in stead of just having a single pointer. This could roughly look like:

"refs": {
      "key0": ["data", ["http://target_url", 10000, 100]],
      "key1": [["http://target_url", 10000, 100], ["http://{{u}}", 10000, 100]],
    }

etc… Using that method, one could completely rearrange existing files. In my current application, I’d like to join existing chunks of uncompressed netCDF file into a single larger chunk to be used within zarr.


A potential issue might be, that the following would become ambiguous:

"refs": {
      "key0": ["https://test"],
    }

This could refer to either just a single piece of raw data containing the text “https://test” or a reference to the entire object behind the link. However, it should be possible to disambiguate this, by defining that single-element raw data blocks must always be written without the list.

Issue Analytics

  • State:open
  • Created 2 years ago
  • Reactions:2
  • Comments:21 (8 by maintainers)

github_iconTop GitHub Comments

1reaction
d70-tcommented, Jul 5, 2022

@martindurant I’ve experimented a bit using some ideas from above: https://github.com/d70-t/preffs It’s a very rudimentary and likely broken implementation of a parquet-based reference filesystem (with eager parquet-loading tough). It supports, references, in-line data and concatenation. I used it to bring down a 6.8 GB reference-JSON (about 60 Million entries) down to 360 MB parquet. The loading time went down from over 20 minutes for json to less than 1 minute for parquet. (Those numbers are all with v0 references, as jinja2 slowed down the things too much to wait for it)

I’ll likely not have the time to work on that project, but it seems to be a very useful direction to go for (especially if lazy loading is on the horizon), so I thought I’ll share it. Maybe there’s someone else to find some time?

0reactions
d70-tcommented, Mar 30, 2022

Note that the non-key data and inlined keys would fit naturally into parquet’s user key-value metadata store (so long as it’s relatively small).

Perfect! Yes, I’d suspect that for most use cases, the inlined keys should be a tiny fraction of the keys. Probably it will be best to just do a lookup for both: first into a small kv-store with inlined keys and then into the large table of references.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Concatenating Files in Linux | Baeldung on Linux
The most frequently used command to concatenate files in Linux is probably cat · We can also use cat to create new files...
Read more >
How to Concatenate Multiple Text Files into One in Windows
Method 1: Using Command Prompt to Concatenate Multiple Text Files in Windows · Step 1: Run Windows Command Prompt · Step 2: Merge...
Read more >
Windows batch - concatenate multiple text files into one
At its most basic, concatenating files from a batch file is done with 'copy'. copy file1 ...
Read more >
How can I concatenate two files in Unix? - Super User
if the file new.txt is an empty file, you can simply use the cat command : ... want to append the concatenated output...
Read more >
Concatenating Files - FileBoss, The Power File Manager for ...
How to concatenate text files with optional linefeeds, tabs and other characters before or after each file with just a few clicks.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found