Zarr use case: problems in multi threaded access to in memory cache
See original GitHub issueI have a read only zarr array stored on Azure Blob. It is a 4 dimensional climate data set, and is chunked along all 4 dimensions. It has about 9000 time steps in total and each chunk has 12 time steps per chunk.
I have a set of jobs, each of which I know will need access to a maximum of 4 time steps, so even if it crosses a chunk boundary, it will need access to a maximum of 2 chunks(as far as the time dimension is concerned). There are hundreds of these jobs, many of which will need access to the same chunk, so I use zarr.storage.LRUStoreCache
to cache the chunks. The problem here is I’m not sure how I can do these jobs in parallel. What I currently do is sort the jobs in time order, and do one job at a time, in a for loop. The only parallelization here is getting the chunks from blob, for let’s say the first job, and then the rest of the jobs just hit the chunks in the cache and don’t have to get if from blob. But because these jobs are done sequentially, for 500 jobs it’s taking about 500 seconds to complete (plus some time to get chunks from blob in the first job, which is much less than 500 seconds). I considered using threads, but it was really slow, presumably because of the GIL? I’m using a 64 core machine on Azure, where I can see the CPU spike during the first job, when it’s getting chunks from blob, but not for the rest of jobs, where cpu usage is minimal. Behavior is the same when using threads. Is there a way I can access chunks from cache in parallel?
PR #2814 from xarray might be relevant here. Any thoughts @rabernat?
Does dask have any method of extracting array chunks from cache in parallel? @mrocklin @jhamman
Issue Analytics
- State:
- Created 4 years ago
- Comments:8 (5 by maintainers)
Hi @shikarsg, FWIW I’d suggest to investigate using Dask for parallelism, although that is not a complete answer as there are lots of details around how you might get Dask and caching to play nicely together.
On Tue, 13 Aug 2019 at 16:35, shikharsg notifications@github.com wrote:
–
Alistair Miles Head of Epidemiological Informatics Centre for Genomics and Global Health Big Data Institute Li Ka Shing Centre for Health Information and Discovery University of Oxford Old Road Campus Headington Oxford OX3 7LF United Kingdom Phone: +44 (0)1865 743596 or +44 (0)7866 541624 Email: alimanfoo@googlemail.com Web: http://a http://purl.org/net/alimanlimanfoo.github.io/ Twitter: @alimanfoo https://twitter.com/alimanfoo
Please feel free to resend your email and/or contact me by other means if you need an urgent reply.
@jakirkham, I managed to implement a cache for decoded results using memcached. Now I can easily have multiple processes access chunks from (and store chunks to) the “same cache”, which is stored outside of these processes in
memcached
. The performance vastly surpasses the sequential method I have described above. I can also see full cpu utilization on my D64 Azure VM.@mrocklin that’s what I thought at first. But the application I am building is something like this: I get requests from the user to process some jobs. And let’s say I give the first set of jobs together to dask(so all computations are given together), but before that set of job finishes I get a few more requests from the user, which I must start before the first set finishes(because I want the rate of processing to be fast) and which also might be using the same zarr chunks as the first set of jobs.