question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Enhancement: Chunk level access api / indexing by chunk rather than voxel

See original GitHub issue

I may be missing something, but zarr currently seems to offers a high level interface to the chunked data that appears to be largely transparent to the fact that the underlying data is in fact chunked (much like HDF5). I’ve got a few use cases where I’d like to be able to directly operate on the underlying chunks rather than the whole large dataset.

My target application is multi-dimensional image processing in large 3D biomedical datasets which there are many situations where it would make sense to perform operations on individual chunks in parallel.

Whilst it would be possible to a) read the chunk size of an array and b) work out how to slice the dataset in multiples of that chunk size, a direct chunk level access might be easier and more efficient. In essence I’m suggesting something which exposes a simplified version of Array._chunk_getitem and Array._chunk_setitem which would only ever get or set a complete chunk.

If you extended this concept somewhat by having the functions return a lightweightChunk object which was essentially a reference to the location of the array, a chunk ID, and a .data property which gave you the data for that chunk and additionally exposed an iterator in the Array class , you could conceivably write code like:

def do_something(chunk):
    res = some_processing_function(chunk.data)
    with zarr.open('output_file_uri') as z1:
        z1.save_chunk(res, chunk.chunk_id)

with multiprocessing.Pool() as pool:
    pool.map(do_something, array.chunk_iterator)

In the longer term (and I’m not sure how to go about this - I might be better aiming for API compatibility with zarr rather than inclusion in zarr) I’d want to enable data-local processing of individual chunks of an array which was chunked across a distributed file system. I’ve currently got a python solution for this for 2.5 dimensional data (xyt) but it’s pretty specific to one use case and I would like to avoid duplicating other efforts as we make it more general.

Issue Analytics

  • State:open
  • Created 4 years ago
  • Reactions:4
  • Comments:21 (16 by maintainers)

github_iconTop GitHub Comments

8reactions
amatsukawacommented, Mar 13, 2020

+1 for this feature. I am a dask user, and second the sentiment of using dask for complex array computations over chunks.

However, I have use cases where I would like to just load zarr chunks where performance is important. I would prefer to just remove the dask layer entirely, since I would just be doing a.blocks[i, j].compute() on random blocks. For now, I’m manually computing chunk boundaries to accomplish this. It’s perfect fine, but it would be nice to have a .blocks API like dask does.

3reactions
alimanfoocommented, Apr 6, 2020

Thanks @amatsukawa for commenting. FWIW I’d still be happy for such a function which returns a specific chunk as a numpy array, given chunk indices. Happy for that to go ahead if it’s useful, would be a relatively small change.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Designing a voxel structure around ECS - Unity Forum
With a chunk size of 32^3 > 4096 voxels, that gives me ~900 chunks ... Is the double access more efficient than having...
Read more >
What is the best approach to take when generating meshes for ...
Using the 'glorious' technique allows you to effectively choose what ever level of rendering performance that you like, increasing the chunk ...
Read more >
How can I optimise a Minecraft-esque voxel world?
I use a brute-force method of rendering "chunks", or pages of data, which is a common technique. Unlike octrees, it is much easier/faster...
Read more >
Indexing API Quickstart - Google Developers
We recommend using the Indexing API instead of sitemaps because the Indexing API prompts Googlebot to crawl your pages sooner than updating the...
Read more >
A voxel-based visualization pipeline for high-resolution ...
plification is performed on the much more compact voxel representation. ... pute shaders are high-level, graphics oriented APIs which aim for direct ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found