Default efficient row iteration
See original GitHub issueIt seems like iterating over a chunked array is inefficient at the moment, presumably because we’re repeatedly decompressing the chunks. For example, if I do
for pos, row in enumerate(data_root.variants):
print(row)
if pos == 1000:
break
it takes several minutes (data_root.variants is a large 2D chunked matrix) but if I do
for pos, row in enumerate(chunk_iterator(data_root.variants)):
print(row)
if pos == 1000:
break
it takes less than a second, where
def chunk_iterator(array):
"""
Utility to iterate over the rows in the specified array efficiently
by accessing one chunk at a time.
"""
chunk_size = array.chunks[0]
for j in range(array.shape[0]):
if j % chunk_size == 0:
chunk = array[j: j + chunk_size][:]
yield chunk[j % chunk_size]
To me, it’s quite a surprising gotcha that zarr isn’t doing this chunkwise decompression, and I think it would be good to do it by default. There is a small extra memory overhead, but I think that’s probably OK, given the performance benefits.
Any thoughts?
Issue Analytics
- State:
- Created 5 years ago
- Comments:5 (5 by maintainers)
Top Results From Across the Web
Default efficient row iteration · Issue #398 · zarr-developers ...
It seems like iterating over a chunked array is inefficient at the moment, presumably because we're repeatedly decompressing the chunks.
Read more >Faster way to iterate over rows - Stack Overflow
I'm trying to divide each row of a dataframe by a number stored in a second mapping dataframe. for(g in rownames(data_table)) ...
Read more >How to iterate over DataFrame rows (and should you?)
First, choosing to iterate over the rows of a DataFrame is not ... If the DataFrame is large, only some columns and rows...
Read more >How to iterate over rows in Pandas: Most efficient options
Most straightforward row iteration. The most straightforward method for iterating over rows is with the iterrows() method, like so:.
Read more >How To Make Your Pandas Loop 71803 Times Faster
DataFrames are Pandas-objects with rows and columns. If you use a loop, you will iterate over the whole object. Python can´t take advantage ......
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Hi @jeromekelleher, I like your suggestion. As @jakirkham says, the cache for decoded chunks would partly solve this, but as you say iteration is actually a simpler requirement where you only cache one chunk at a time while you iterate through it. PR welcome.
I’d suggest adding a
test_iter()
method on theTestArray
class in thetest_core
module. E.g., something like: