question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Support to use zarr.sync.ProcessSynchronizer(path) with S3 as path

See original GitHub issue

Hi everyone,

I’ve been digging around to see if there’s already an existing way to use zarr.sync.ProcessSynchronizer(path) with S3 as path, but no luck.

My scenario is I have a Lambda function that listens to S3 events and writes NetCDF files to a Zarr store (on S3), each Lambda call will process one NetCDF file.

As Lambda is a distributed system, 10 new files uploaded will trigger 10 different processes that try to write to the Zarr store pretty much at the same time, and I experience some data corruption issues.

Using zarr.sync.ProcessSynchronizer() in xarray.dataset.to_zarr(synchronizer=...) for DirectoryStore seems to solve this write consistency issue.

But storing Zarr store on S3 is important to us, and cloud-optimised format like Zarr should be able to fully support S3. So I wonder if this is a bug or a non-existing feature or I just don’t know it yet.

Please advise.

Thanks everyone.

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:8 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
rabernatcommented, Nov 17, 2022

Hi @vietnguyengit and welcome! As far as I know, there is no way to provide the sort of synchronization you’re looking for using existing tools. The fact is that S3 is an “eventually consistent” store, meaning that an out-of-band mechanism is required to manage synchronization and locking.

Question: are the writes to overlapping chunks? Or can you guarantee that each write from lambda will not overlap with other writes? If so, you should be able to avoid the need for synchronization completely.

If not, something like https://github.com/zarr-developers/zarr-specs/issues/154 might be the solution. This is an area we are working on actively at the moment.

0reactions
vietnguyengitcommented, Nov 18, 2022

Thanks @tasansal I see, we do have workflows with Dask and orchestrated with Prefect to process massive Zarr store for aggregated data from hundreds of thousands of NetCDF files. And that works fine.

For example:

The experiments with Lambda were specifically to handle the case when some of the ingested files in that “Big” Zarr store were being revised (e.g. data provider recalibrated their calculations etc.) and we want the relevant regions of the Zarr store to be updated to reflect new data changes.

Anyhow, we concluded that Lambda was not fit for the purposes due to consistency issues.

The ability to have locks when processing files in multiple processes to the S3 Zarr store will help us in the decision for “event-driven” architecture when we receive “revised NetCDF files” from the data providers.

For now, scheduled flows bring fewer problems to deal with for our cases.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Synchronization (zarr.sync) — zarr 2.13.3 documentation
Provides synchronization using file locks via the fasteners package. Parameters ... Path to a directory on a file system that is shared by...
Read more >
Python multiprocessing writes get deadlocked on Linux systems
Hi everyone, Our team is trying to write to zarr arrays on AWS S3 using Python's built-in multiprocessing (mp) tools. Once we start...
Read more >
Many netcdf to single zarr store using concurrent.futures - Data
ProcessSynchronizer ('zarr/sync_zarr.sync') def load_netcdf_write_zarr_1store(fname): ds = xr.open_dataset(os.path.join(datadir,fname)) fname ...
Read more >
A step-by-step guide to synchronize data between Amazon S3 ...
Once support for replication of existing objects has been enabled for the AWS account, you will be able to use S3 Replication for...
Read more >
zarr - Bountysource
LMDBStore(store_file) sync_dir = os.path.splitext(store_file)[0] + ".sync" synchronizer = zarr.ProcessSynchronizer(sync_dir) Blosc.use_threads = False ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found