Locks and chunked storage
See original GitHub issuedask.array.store
takes an optional argument, lock
, which (to my understanding) avoids write contention by forcing workers to request access to the write target before writing. But for chunked storage like zarr arrays, write contention happens at the level of individual chunks, not the entire array. So perhaps a lock for chunked writes should have the granularity of the chunk structure of the storage target, thereby allowing the scheduler to control access to individual chunks. Does this make sense?
The context for this question is my attempt to optimize storing multiple enormous dask arrays in zarr containers with a minimum amount of rechunking, which makes the locking mechanism attractive, as long as the lock happens at the right place.
cc @jakirkham in case you have ideas about this.
Issue Analytics
- State:
- Created 2 years ago
- Comments:29 (22 by maintainers)
Top GitHub Comments
reproducer: https://gist.github.com/d-v-b/f4cf44e42f4e9d9cfa2d109d2ad26500
performance reports:
2021.06.0
2021.07.0
main
If it’s not too much trouble, I’d love to see a before/after performance report for
2021.07.0
. Did it get worse both with and without the final rechunk?It’s true that it’s a bit silly to move the data between workers just for the purpose of concatenating and storing it. Having workers synchronize writing the data they already have is simpler in many ways. I think we just all want this rechunking to perform better, so we’re eager to understand why it’s not working 😄