Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Support to use zarr.sync.ProcessSynchronizer(path) with S3 as path

See original GitHub issue

Hi everyone,

I’ve been digging around to see if there’s already an existing way to use zarr.sync.ProcessSynchronizer(path) with S3 as path, but no luck.

My scenario is I have a Lambda function that listens to S3 events and writes NetCDF files to a Zarr store (on S3), each Lambda call will process one NetCDF file.

As Lambda is a distributed system, 10 new files uploaded will trigger 10 different processes that try to write to the Zarr store pretty much at the same time, and I experience some data corruption issues.

Using zarr.sync.ProcessSynchronizer() in xarray.dataset.to_zarr(synchronizer=...) for DirectoryStore seems to solve this write consistency issue.

But storing Zarr store on S3 is important to us, and cloud-optimised format like Zarr should be able to fully support S3. So I wonder if this is a bug or a non-existing feature or I just don’t know it yet.

Please advise.

Thanks everyone.

Issue Analytics

State:
Created a year ago
Comments:8 (5 by maintainers)

Top GitHub Comments

1reaction

rabernatcommented, Nov 17, 2022

Hi @vietnguyengit and welcome! As far as I know, there is no way to provide the sort of synchronization you’re looking for using existing tools. The fact is that S3 is an “eventually consistent” store, meaning that an out-of-band mechanism is required to manage synchronization and locking.

Question: are the writes to overlapping chunks? Or can you guarantee that each write from lambda will not overlap with other writes? If so, you should be able to avoid the need for synchronization completely.

If not, something like https://github.com/zarr-developers/zarr-specs/issues/154 might be the solution. This is an area we are working on actively at the moment.

0reactions

vietnguyengitcommented, Nov 18, 2022

Thanks @tasansal I see, we do have workflows with Dask and orchestrated with Prefect to process massive Zarr store for aggregated data from hundreds of thousands of NetCDF files. And that works fine.

For example:

http://ec2-3-105-15-240.ap-southeast-2.compute.amazonaws.com/ here we have 30 years SST data visualised on a webmap with Zarr as a source for the tiles.
With Argo, we have a Zarr store of too many NetCDF files.

The experiments with Lambda were specifically to handle the case when some of the ingested files in that “Big” Zarr store were being revised (e.g. data provider recalibrated their calculations etc.) and we want the relevant regions of the Zarr store to be updated to reflect new data changes.

Anyhow, we concluded that Lambda was not fit for the purposes due to consistency issues.

The ability to have locks when processing files in multiple processes to the S3 Zarr store will help us in the decision for “event-driven” architecture when we receive “revised NetCDF files” from the data providers.

For now, scheduled flows bring fewer problems to deal with for our cases.

Top Results From Across the Web

Synchronization (zarr.sync) — zarr 2.13.3 documentation

Provides synchronization using file locks via the fasteners package. Parameters ... Path to a directory on a file system that is shared by...

Python multiprocessing writes get deadlocked on Linux systems

Hi everyone, Our team is trying to write to zarr arrays on AWS S3 using Python's built-in multiprocessing (mp) tools. Once we start...

Many netcdf to single zarr store using concurrent.futures - Data

ProcessSynchronizer ('zarr/sync_zarr.sync') def load_netcdf_write_zarr_1store(fname): ds = xr.open_dataset(os.path.join(datadir,fname)) fname ...

A step-by-step guide to synchronize data between Amazon S3 ...

Once support for replication of existing objects has been enabled for the AWS account, you will be able to use S3 Replication for...

zarr - Bountysource

LMDBStore(store_file) sync_dir = os.path.splitext(store_file)[0] + ".sync" synchronizer = zarr.ProcessSynchronizer(sync_dir) Blosc.use_threads = False ...