Support to use zarr.sync.ProcessSynchronizer(path) with S3 as path
See original GitHub issueHi everyone,
I’ve been digging around to see if there’s already an existing way to use zarr.sync.ProcessSynchronizer(path)
with S3 as path
, but no luck.
My scenario is I have a Lambda function that listens to S3 events and writes NetCDF files to a Zarr store (on S3), each Lambda call will process one NetCDF file.
As Lambda is a distributed system, 10 new files uploaded will trigger 10 different processes that try to write to the Zarr store pretty much at the same time, and I experience some data corruption issues.
Using zarr.sync.ProcessSynchronizer()
in xarray.dataset.to_zarr(synchronizer=...)
for DirectoryStore
seems to solve this write consistency issue.
But storing Zarr store on S3
is important to us, and cloud-optimised format like Zarr should be able to fully support S3. So I wonder if this is a bug or a non-existing feature or I just don’t know it yet.
Please advise.
Thanks everyone.
Issue Analytics
- State:
- Created a year ago
- Comments:8 (5 by maintainers)
Hi @vietnguyengit and welcome! As far as I know, there is no way to provide the sort of synchronization you’re looking for using existing tools. The fact is that S3 is an “eventually consistent” store, meaning that an out-of-band mechanism is required to manage synchronization and locking.
Question: are the writes to overlapping chunks? Or can you guarantee that each write from lambda will not overlap with other writes? If so, you should be able to avoid the need for synchronization completely.
If not, something like https://github.com/zarr-developers/zarr-specs/issues/154 might be the solution. This is an area we are working on actively at the moment.
Thanks @tasansal I see, we do have workflows with
Dask
and orchestrated withPrefect
to process massive Zarr store for aggregated data from hundreds of thousands of NetCDF files. And that works fine.For example:
http://ec2-3-105-15-240.ap-southeast-2.compute.amazonaws.com/ here we have 30 years SST data visualised on a webmap with Zarr as a source for the tiles.
With Argo, we have a Zarr store of too many NetCDF files.
The experiments with Lambda were specifically to handle the case when some of the ingested files in that “Big” Zarr store were being revised (e.g. data provider recalibrated their calculations etc.) and we want the relevant regions of the Zarr store to be updated to reflect new data changes.
Anyhow, we concluded that Lambda was not fit for the purposes due to consistency issues.
The ability to have locks when processing files in multiple processes to the S3 Zarr store will help us in the decision for “event-driven” architecture when we receive “revised NetCDF files” from the data providers.
For now, scheduled flows bring fewer problems to deal with for our cases.