Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

File Chunk Store

See original GitHub issue

Hello!

I want to propose adding a new Zarr store type when all array chunks are located in a single binary file. A propotype implementation, named file chunk store, is described in this Medium post. In this approach, Zarr metadata (.zgroup, .zarray, .zattrs, or .zmetadata) are stored in one of the current Zarr store types while the array chunks are in a binary file. The file chunk store translates array chunk keys into file seek and read operations and therefore only provides read access to the chunk data.

The file chunk store requires a mapping between array chunk keys and their file locations. The prototype implementation put this information for every Zarr array in JSON files named .zchunkstore. An example is below:

   {
    "BEAM0001/tx_pulseflag/0": {
        "offset": 94854560,
        "size": 120
    },
    "BEAM0001/tx_pulseflag/1": {
        "offset": 94854680,
        "size": 120
    },
    "BEAM0001/tx_pulseflag/2": {
        "offset": 94854800,
        "size": 120
    },
    "BEAM0001/tx_pulseflag/3": {
        "offset": 94854920,
        "size": 120
    },
    "BEAM0001/tx_pulseflag/4": {
        "offset": 96634038,
        "size": 120
    },
    "BEAM0001/tx_pulseflag/5": {
        "offset": 96634158,
        "size": 123
    },
    "source": {
        "array_name": "/BEAM0001/tx_pulseflag",
        "uri": "https://e4ftl01.cr.usgs.gov/GEDI/GEDI01_B.001/2019.05.26/GEDI01_B_2019146164739_O02560_T04067_02_003_01.h5"
    }
}

Array chunk file location is described with the starting byte (offset) in the file and the number of bytes to read (size). Also included is the file information (source) to enable verification of chunk data provenance. The file chunk store prototype uses file-like Python objects, delegating to users the responsibility to arrange access to correct files.

We can discuss specific implementation details If there is enough interest in this new store type.

Thanks!

Issue Analytics

State:
Created 3 years ago
Reactions:8
Comments:31 (27 by maintainers)

Top GitHub Comments

3reactions

manztcommented, Sep 1, 2020

I’m curious if people here have tried existing single file stores like ZipStore, DBMStore, LMDBStore or SQLiteStore?

I’ve experimented with ZipStore, but reading a ZipStore remotely via HTTP is not very performant and not well supported. Traversing the central directory at the end of the zip file to find the chunk byte offsets takes a long time (and many requests) for large stores, making the DirectoryStore most ideal for archival storage.

As a side note, I wrote a small python package to serve the underlying store for any zarr-python zarr.Array or zarr.Group over HTTP (simple-zarr-server). It works by mapping HTTP requests to the underlying store.__getitem__ and store.__setitem___, making any store accessible by a python client with fsspec.HTTPFileSystem. Not ideal for archival storage, again, but at least a way access non-DirectoryStore remotely via fsspec.

However, I think the salient point here is that there is already a significant latent investment in netCDF/HDF that is stored in cloud storage and this unlocks significant performance gains and to a degree software stack simplification if these can be accessed via Zarr store without needing to invest considerable overhead to convert the data format.

Agreed. Something like a “File Chunk Store” might offer a more standardized way to read other tiled/chunked formats without requiring conversion (despite not being as performant as the built-in stores).

3reactions

rabernatcommented, Aug 31, 2020

Together with pydata/xarray#3804, the idea proposed here could unlock an amazing capability: accessing a big HDF5 file using Zarr very efficiently. I think it’s important to find a way to move forward with it.

Top Results From Across the Web

fs-chunk-store - npm

Filesystem (fs) chunk store that is abstract-chunk-store compliant. Latest version: 4.0.0, last published: 24 days ago.

mafintosh/abstract-chunk-store: A test suite and ... - GitHub

An abstract chunk store is a binary data store that allows you to interact with individual chunks of a larger blob (a.k.a. binary...

File Storage in Chunks - Courses - Educative.io

Hi!! It has been mentioned in the design considerations to store the files in chunks so that it becomes easy to upload/download only...

Zarr.js - File Chunk Store / Trevor Manz - Observable

This notebook provides a proof-of-concept for generating a File Chunk Store for a python-generated ZipStore and accessing it via a simple client ...

Understanding Data Deduplication | Microsoft Learn

The chunk store is an organized series of container files in the System Volume Information folder that Data Deduplication uses to uniquely store ......