File Chunk Store
See original GitHub issueHello!
I want to propose adding a new Zarr store type when all array chunks are located in a single binary file. A propotype implementation, named file chunk store, is described in this Medium post. In this approach, Zarr metadata (.zgroup
, .zarray
, .zattrs
, or .zmetadata
) are stored in one of the current Zarr store types while the array chunks are in a binary file. The file chunk store translates array chunk keys into file seek and read operations and therefore only provides read access to the chunk data.
The file chunk store requires a mapping between array chunk keys and their file locations. The prototype implementation put this information for every Zarr array in JSON files named .zchunkstore
. An example is below:
{
"BEAM0001/tx_pulseflag/0": {
"offset": 94854560,
"size": 120
},
"BEAM0001/tx_pulseflag/1": {
"offset": 94854680,
"size": 120
},
"BEAM0001/tx_pulseflag/2": {
"offset": 94854800,
"size": 120
},
"BEAM0001/tx_pulseflag/3": {
"offset": 94854920,
"size": 120
},
"BEAM0001/tx_pulseflag/4": {
"offset": 96634038,
"size": 120
},
"BEAM0001/tx_pulseflag/5": {
"offset": 96634158,
"size": 123
},
"source": {
"array_name": "/BEAM0001/tx_pulseflag",
"uri": "https://e4ftl01.cr.usgs.gov/GEDI/GEDI01_B.001/2019.05.26/GEDI01_B_2019146164739_O02560_T04067_02_003_01.h5"
}
}
Array chunk file location is described with the starting byte (offset
) in the file and the number of bytes to read (size
). Also included is the file information (source
) to enable verification of chunk data provenance. The file chunk store prototype uses file-like Python objects, delegating to users the responsibility to arrange access to correct files.
We can discuss specific implementation details If there is enough interest in this new store type.
Thanks!
Issue Analytics
- State:
- Created 3 years ago
- Reactions:8
- Comments:31 (27 by maintainers)
I’ve experimented with
ZipStore
, but reading aZipStore
remotely via HTTP is not very performant and not well supported. Traversing the central directory at the end of the zip file to find the chunk byte offsets takes a long time (and many requests) for large stores, making theDirectoryStore
most ideal for archival storage.As a side note, I wrote a small python package to serve the underlying store for any zarr-python
zarr.Array
orzarr.Group
over HTTP (simple-zarr-server
). It works by mapping HTTP requests to the underlyingstore.__getitem__
andstore.__setitem___
, making any store accessible by a python client withfsspec.HTTPFileSystem
. Not ideal for archival storage, again, but at least a way access non-DirectoryStore
remotely viafsspec
.Agreed. Something like a “File Chunk Store” might offer a more standardized way to read other tiled/chunked formats without requiring conversion (despite not being as performant as the built-in stores).
Together with pydata/xarray#3804, the idea proposed here could unlock an amazing capability: accessing a big HDF5 file using Zarr very efficiently. I think it’s important to find a way to move forward with it.