question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Hello!

I want to propose adding a new Zarr store type when all array chunks are located in a single binary file. A propotype implementation, named file chunk store, is described in this Medium post. In this approach, Zarr metadata (.zgroup, .zarray, .zattrs, or .zmetadata) are stored in one of the current Zarr store types while the array chunks are in a binary file. The file chunk store translates array chunk keys into file seek and read operations and therefore only provides read access to the chunk data.

The file chunk store requires a mapping between array chunk keys and their file locations. The prototype implementation put this information for every Zarr array in JSON files named .zchunkstore. An example is below:

   {
    "BEAM0001/tx_pulseflag/0": {
        "offset": 94854560,
        "size": 120
    },
    "BEAM0001/tx_pulseflag/1": {
        "offset": 94854680,
        "size": 120
    },
    "BEAM0001/tx_pulseflag/2": {
        "offset": 94854800,
        "size": 120
    },
    "BEAM0001/tx_pulseflag/3": {
        "offset": 94854920,
        "size": 120
    },
    "BEAM0001/tx_pulseflag/4": {
        "offset": 96634038,
        "size": 120
    },
    "BEAM0001/tx_pulseflag/5": {
        "offset": 96634158,
        "size": 123
    },
    "source": {
        "array_name": "/BEAM0001/tx_pulseflag",
        "uri": "https://e4ftl01.cr.usgs.gov/GEDI/GEDI01_B.001/2019.05.26/GEDI01_B_2019146164739_O02560_T04067_02_003_01.h5"
    }
}

Array chunk file location is described with the starting byte (offset) in the file and the number of bytes to read (size). Also included is the file information (source) to enable verification of chunk data provenance. The file chunk store prototype uses file-like Python objects, delegating to users the responsibility to arrange access to correct files.

We can discuss specific implementation details If there is enough interest in this new store type.

Thanks!

Issue Analytics

  • State:open
  • Created 3 years ago
  • Reactions:8
  • Comments:31 (27 by maintainers)

github_iconTop GitHub Comments

3reactions
manztcommented, Sep 1, 2020

I’m curious if people here have tried existing single file stores like ZipStore, DBMStore, LMDBStore or SQLiteStore?

I’ve experimented with ZipStore, but reading a ZipStore remotely via HTTP is not very performant and not well supported. Traversing the central directory at the end of the zip file to find the chunk byte offsets takes a long time (and many requests) for large stores, making the DirectoryStore most ideal for archival storage.

As a side note, I wrote a small python package to serve the underlying store for any zarr-python zarr.Array or zarr.Group over HTTP (simple-zarr-server). It works by mapping HTTP requests to the underlying store.__getitem__ and store.__setitem___, making any store accessible by a python client with fsspec.HTTPFileSystem. Not ideal for archival storage, again, but at least a way access non-DirectoryStore remotely via fsspec.

However, I think the salient point here is that there is already a significant latent investment in netCDF/HDF that is stored in cloud storage and this unlocks significant performance gains and to a degree software stack simplification if these can be accessed via Zarr store without needing to invest considerable overhead to convert the data format.

Agreed. Something like a “File Chunk Store” might offer a more standardized way to read other tiled/chunked formats without requiring conversion (despite not being as performant as the built-in stores).

3reactions
rabernatcommented, Aug 31, 2020

Together with pydata/xarray#3804, the idea proposed here could unlock an amazing capability: accessing a big HDF5 file using Zarr very efficiently. I think it’s important to find a way to move forward with it.

Read more comments on GitHub >

github_iconTop Results From Across the Web

fs-chunk-store - npm
Filesystem (fs) chunk store that is abstract-chunk-store compliant. Latest version: 4.0.0, last published: 24 days ago.
Read more >
mafintosh/abstract-chunk-store: A test suite and ... - GitHub
An abstract chunk store is a binary data store that allows you to interact with individual chunks of a larger blob (a.k.a. binary...
Read more >
File Storage in Chunks - Courses - Educative.io
Hi!! It has been mentioned in the design considerations to store the files in chunks so that it becomes easy to upload/download only...
Read more >
Zarr.js - File Chunk Store / Trevor Manz - Observable
This notebook provides a proof-of-concept for generating a File Chunk Store for a python-generated ZipStore and accessing it via a simple client ...
Read more >
Understanding Data Deduplication | Microsoft Learn
The chunk store is an organized series of container files in the System Volume Information folder that Data Deduplication uses to uniquely store ......
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found