Enhancement: Chunk level access api / indexing by chunk rather than voxel
See original GitHub issueI may be missing something, but zarr currently seems to offers a high level interface to the chunked data that appears to be largely transparent to the fact that the underlying data is in fact chunked (much like HDF5). I’ve got a few use cases where I’d like to be able to directly operate on the underlying chunks rather than the whole large dataset.
My target application is multi-dimensional image processing in large 3D biomedical datasets which there are many situations where it would make sense to perform operations on individual chunks in parallel.
Whilst it would be possible to a) read the chunk size of an array and b) work out how to slice the dataset in multiples of that chunk size, a direct chunk level access might be easier and more efficient. In essence I’m suggesting something which exposes a simplified version of Array._chunk_getitem
and Array._chunk_setitem
which would only ever get or set a complete chunk.
If you extended this concept somewhat by having the functions return a lightweightChunk
object which was essentially a reference to the location of the array, a chunk ID, and a .data
property which gave you the data for that chunk and additionally exposed an iterator in the Array
class , you could conceivably write code like:
def do_something(chunk):
res = some_processing_function(chunk.data)
with zarr.open('output_file_uri') as z1:
z1.save_chunk(res, chunk.chunk_id)
with multiprocessing.Pool() as pool:
pool.map(do_something, array.chunk_iterator)
In the longer term (and I’m not sure how to go about this - I might be better aiming for API compatibility with zarr rather than inclusion in zarr) I’d want to enable data-local processing of individual chunks of an array which was chunked across a distributed file system. I’ve currently got a python solution for this for 2.5 dimensional data (xyt) but it’s pretty specific to one use case and I would like to avoid duplicating other efforts as we make it more general.
Issue Analytics
- State:
- Created 4 years ago
- Reactions:4
- Comments:21 (16 by maintainers)
+1 for this feature. I am a dask user, and second the sentiment of using dask for complex array computations over chunks.
However, I have use cases where I would like to just load zarr chunks where performance is important. I would prefer to just remove the dask layer entirely, since I would just be doing
a.blocks[i, j].compute()
on random blocks. For now, I’m manually computing chunk boundaries to accomplish this. It’s perfect fine, but it would be nice to have a.blocks
API like dask does.Thanks @amatsukawa for commenting. FWIW I’d still be happy for such a function which returns a specific chunk as a numpy array, given chunk indices. Happy for that to go ahead if it’s useful, would be a relatively small change.