question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Zarr use case: problems in multi threaded access to in memory cache

See original GitHub issue

I have a read only zarr array stored on Azure Blob. It is a 4 dimensional climate data set, and is chunked along all 4 dimensions. It has about 9000 time steps in total and each chunk has 12 time steps per chunk.

I have a set of jobs, each of which I know will need access to a maximum of 4 time steps, so even if it crosses a chunk boundary, it will need access to a maximum of 2 chunks(as far as the time dimension is concerned). There are hundreds of these jobs, many of which will need access to the same chunk, so I use zarr.storage.LRUStoreCache to cache the chunks. The problem here is I’m not sure how I can do these jobs in parallel. What I currently do is sort the jobs in time order, and do one job at a time, in a for loop. The only parallelization here is getting the chunks from blob, for let’s say the first job, and then the rest of the jobs just hit the chunks in the cache and don’t have to get if from blob. But because these jobs are done sequentially, for 500 jobs it’s taking about 500 seconds to complete (plus some time to get chunks from blob in the first job, which is much less than 500 seconds). I considered using threads, but it was really slow, presumably because of the GIL? I’m using a 64 core machine on Azure, where I can see the CPU spike during the first job, when it’s getting chunks from blob, but not for the rest of jobs, where cpu usage is minimal. Behavior is the same when using threads. Is there a way I can access chunks from cache in parallel?

PR #2814 from xarray might be relevant here. Any thoughts @rabernat?

Does dask have any method of extracting array chunks from cache in parallel? @mrocklin @jhamman

@jakirkham @alimanfoo

Issue Analytics

  • State:open
  • Created 4 years ago
  • Comments:8 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
alimanfoocommented, Aug 13, 2019

Hi @shikarsg, FWIW I’d suggest to investigate using Dask for parallelism, although that is not a complete answer as there are lots of details around how you might get Dask and caching to play nicely together.

On Tue, 13 Aug 2019 at 16:35, shikharsg notifications@github.com wrote:

I have a read only zarr array stored on Azure Blob. It is a 4 dimensional climate data set, and is chunked along all 4 dimensions. It has about 9000 time steps in total and each chunk has 12 time steps per chunk.

I have a set of jobs, each of which I know will need access to a maximum of 4 time steps, so even if it crosses a chunk boundary, it will need access to a maximum of 2 chunks(as far as the time dimension is concerned). There are hundreds of these jobs, many of which will need access to the same chunk, so I use zarr.storage.LRUStoreCache to cache the chunks. The problem here is I’m not sure how I can do these jobs in parallel. What I currently do is sort the jobs in time order, and do one job at a time, in a for loop. The only parallelization here is getting the chunks from blob, for let’s say the first job, and then the rest of the jobs just hit the chunks in the cache and don’t have to get if from blob. But because these jobs are done sequentially, for 500 jobs it’s taking about 500 seconds to complete (plus some time to get chunks from blob in the first job, which is much less than 500 seconds). I considered using threads, but it was really slow, presumably because of the GIL? I’m using a 64 core machine on Azure, where I can see the CPU spike during the first job, when it’s getting chunks from blob, but not for the rest of jobs, where cpu usage is minimal. Behavior is the same when using threads. Is there a way I can access chunks from cache in parallel?

PR #2814 https://github.com/pydata/xarray/pull/2814 from xarray might be relevant here. Any thoughts @rabernat https://github.com/rabernat?

Does dask have any method of extracting array chunks from cache in parallel? @mrocklin https://github.com/mrocklin @jhamman https://github.com/jhamman

@jakirkham https://github.com/jakirkham @alimanfoo https://github.com/alimanfoo

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/zarr-developers/zarr-python/issues/469?email_source=notifications&email_token=AAFLYQR6KYGGCOULSFML4NLQELIDVA5CNFSM4ILMCV2KYY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4HE7VL3A, or mute the thread https://github.com/notifications/unsubscribe-auth/AAFLYQUMZAUIVTD4HBUP7K3QELIDVANCNFSM4ILMCV2A .

Alistair Miles Head of Epidemiological Informatics Centre for Genomics and Global Health Big Data Institute Li Ka Shing Centre for Health Information and Discovery University of Oxford Old Road Campus Headington Oxford OX3 7LF United Kingdom Phone: +44 (0)1865 743596 or +44 (0)7866 541624 Email: alimanfoo@googlemail.com Web: http://a http://purl.org/net/alimanlimanfoo.github.io/ Twitter: @alimanfoo https://twitter.com/alimanfoo

Please feel free to resend your email and/or contact me by other means if you need an urgent reply.

0reactions
shikharsgcommented, Aug 31, 2019

I think at some point we are going to want to detach the concept of storage backend from where it sits in the loading pipeline. For example, when retrieving data from a cloud store, I might want to have an intermediate storage layer like a local database that provides quicker access to some data. Additionally I may want something after that, which holds data in-memory. Though maybe that writes some data to disk. We will want that functionality regardless of whether it uses a Plasma Object store, LMDB, or something else.

@jakirkham, I managed to implement a cache for decoded results using memcached. Now I can easily have multiple processes access chunks from (and store chunks to) the “same cache”, which is stored outside of these processes in memcached. The performance vastly surpasses the sequential method I have described above. I can also see full cpu utilization on my D64 Azure VM.

Perhaps instead of using a cache you can just give all of the computations to Dask at once and have it keep track of the reusing the data many times itself. On Wed, Aug 14, 2019 at 11:59 AM dazzag24 @.***> wrote: So some kind of automatic local caching of data IF you are using a cloud storage backend? makes sense. Would this be a cache of the compressed or uncompressed chunks? In @shikharsg https://github.com/shikharsg case I believe he is caching the uncompressed chunks in memory. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#469?email_source=notifications&email_token=AACKZTEH4XOR2D4UJ4P3FT3QEQTWRA5CNFSM4ILMCV2KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD4JIR6A#issuecomment-521308408>, or mute the thread https://github.com/notifications/unsubscribe-auth/AACKZTGTDY6D6F46ZLGANMDQEQTWRANCNFSM4ILMCV2A .

@mrocklin that’s what I thought at first. But the application I am building is something like this: I get requests from the user to process some jobs. And let’s say I give the first set of jobs together to dask(so all computations are given together), but before that set of job finishes I get a few more requests from the user, which I must start before the first set finishes(because I want the rate of processing to be fast) and which also might be using the same zarr chunks as the first set of jobs.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Tutorial — zarr 2.13.3 documentation - Read the Docs
When a Zarr array is being used within a multi-threaded program, Zarr automatically switches to using Blosc in a single-threaded “contextual” mode. This...
Read more >
Using dask and zarr for multithreaded input/output
It integrates with zarr to allow calculations on datasets that don't fit into core memory, either in a single node or across a...
Read more >
Moving away from HDF5 - Cyrille Rossant
Modern multithreaded codecs like Blosc now allow access to compressed data faster than uncompressed in many cases. Parallel access: This is ...
Read more >
Best practices to go from 1000s of netcdf files to analyses on a ...
We have also made 19 zarr arrays containing about 20-day worth of data, which seem to load in a notebook, and manage to...
Read more >
Building open source downscaling pipelines for the cloud
After experimenting with xpersist, we resolved our caching issues using a simple, custom approach. First, we hashed the relevant input arguments ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found