Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Enhancement: Chunk level access api / indexing by chunk rather than voxel

See original GitHub issue

I may be missing something, but zarr currently seems to offers a high level interface to the chunked data that appears to be largely transparent to the fact that the underlying data is in fact chunked (much like HDF5). I’ve got a few use cases where I’d like to be able to directly operate on the underlying chunks rather than the whole large dataset.

My target application is multi-dimensional image processing in large 3D biomedical datasets which there are many situations where it would make sense to perform operations on individual chunks in parallel.

Whilst it would be possible to a) read the chunk size of an array and b) work out how to slice the dataset in multiples of that chunk size, a direct chunk level access might be easier and more efficient. In essence I’m suggesting something which exposes a simplified version of Array._chunk_getitem and Array._chunk_setitem which would only ever get or set a complete chunk.

If you extended this concept somewhat by having the functions return a lightweightChunk object which was essentially a reference to the location of the array, a chunk ID, and a .data property which gave you the data for that chunk and additionally exposed an iterator in the Array class , you could conceivably write code like:

def do_something(chunk):
    res = some_processing_function(chunk.data)
    with zarr.open('output_file_uri') as z1:
        z1.save_chunk(res, chunk.chunk_id)

with multiprocessing.Pool() as pool:
    pool.map(do_something, array.chunk_iterator)

In the longer term (and I’m not sure how to go about this - I might be better aiming for API compatibility with zarr rather than inclusion in zarr) I’d want to enable data-local processing of individual chunks of an array which was chunked across a distributed file system. I’ve currently got a python solution for this for 2.5 dimensional data (xyt) but it’s pretty specific to one use case and I would like to avoid duplicating other efforts as we make it more general.

Issue Analytics

State:
Created 4 years ago
Reactions:4
Comments:21 (16 by maintainers)

Top GitHub Comments

8reactions

amatsukawacommented, Mar 13, 2020

+1 for this feature. I am a dask user, and second the sentiment of using dask for complex array computations over chunks.

However, I have use cases where I would like to just load zarr chunks where performance is important. I would prefer to just remove the dask layer entirely, since I would just be doing a.blocks[i, j].compute() on random blocks. For now, I’m manually computing chunk boundaries to accomplish this. It’s perfect fine, but it would be nice to have a .blocks API like dask does.

3reactions

alimanfoocommented, Apr 6, 2020

Thanks @amatsukawa for commenting. FWIW I’d still be happy for such a function which returns a specific chunk as a numpy array, given chunk indices. Happy for that to go ahead if it’s useful, would be a relatively small change.

Top Results From Across the Web

Designing a voxel structure around ECS - Unity Forum

With a chunk size of 32^3 > 4096 voxels, that gives me ~900 chunks ... Is the double access more efficient than having...

What is the best approach to take when generating meshes for ...

Using the 'glorious' technique allows you to effectively choose what ever level of rendering performance that you like, increasing the chunk ...

How can I optimise a Minecraft-esque voxel world?

I use a brute-force method of rendering "chunks", or pages of data, which is a common technique. Unlike octrees, it is much easier/faster...

Indexing API Quickstart - Google Developers

We recommend using the Indexing API instead of sitemaps because the Indexing API prompts Googlebot to crawl your pages sooner than updating the...

A voxel-based visualization pipeline for high-resolution ...

plification is performed on the much more compact voxel representation. ... pute shaders are high-level, graphics oriented APIs which aim for direct ...