question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Default efficient row iteration

See original GitHub issue

It seems like iterating over a chunked array is inefficient at the moment, presumably because we’re repeatedly decompressing the chunks. For example, if I do

    for pos, row in enumerate(data_root.variants):
        print(row)
        if pos == 1000:
            break

it takes several minutes (data_root.variants is a large 2D chunked matrix) but if I do

    for pos, row in enumerate(chunk_iterator(data_root.variants)):
        print(row)
        if pos == 1000:
            break

it takes less than a second, where

def chunk_iterator(array):
    """ 
    Utility to iterate over the rows in the specified array efficiently
    by accessing one chunk at a time.
    """
    chunk_size = array.chunks[0]
    for j in range(array.shape[0]):
        if j % chunk_size == 0:
            chunk = array[j: j + chunk_size][:]
        yield chunk[j % chunk_size]

To me, it’s quite a surprising gotcha that zarr isn’t doing this chunkwise decompression, and I think it would be good to do it by default. There is a small extra memory overhead, but I think that’s probably OK, given the performance benefits.

Any thoughts?

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Comments:5 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
alimanfoocommented, Jan 29, 2019

Hi @jeromekelleher, I like your suggestion. As @jakirkham says, the cache for decoded chunks would partly solve this, but as you say iteration is actually a simpler requirement where you only cache one chunk at a time while you iterate through it. PR welcome.

0reactions
alimanfoocommented, Feb 5, 2019

I’d suggest adding a test_iter() method on the TestArray class in the test_core module. E.g., something like:

    test_iter(self):
        params = (
            ((1000,), (100,)),
            ((100, 100), (10, 10)),
            # any other combination of shape and chunks you'd like to test
        )
        for shape, chunks in params:
            z = self.create_array(shape=shape, chunks=chunks, dtype=int)
            a = np.arange(np.product(shape)).reshape(shape)
            z[:] = a
            for expect, actual in izip_longest(a, z):
                assert_array_equal(expect, actual)
Read more comments on GitHub >

github_iconTop Results From Across the Web

Default efficient row iteration · Issue #398 · zarr-developers ...
It seems like iterating over a chunked array is inefficient at the moment, presumably because we're repeatedly decompressing the chunks.
Read more >
Faster way to iterate over rows - Stack Overflow
I'm trying to divide each row of a dataframe by a number stored in a second mapping dataframe. for(g in rownames(data_table)) ...
Read more >
How to iterate over DataFrame rows (and should you?)
First, choosing to iterate over the rows of a DataFrame is not ... If the DataFrame is large, only some columns and rows...
Read more >
How to iterate over rows in Pandas: Most efficient options
Most straightforward row iteration. The most straightforward method for iterating over rows is with the iterrows() method, like so:.
Read more >
How To Make Your Pandas Loop 71803 Times Faster
DataFrames are Pandas-objects with rows and columns. If you use a loop, you will iterate over the whole object. Python can´t take advantage ......
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found