question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Proof of concept: CloudFilesStore

See original GitHub issue

We currently rely 100% on fsspec and its implementations for accessing cloud storage (s3fs, gcsfs, adlfs). Cloud storage is complicated, and for debugging purposes, it could be useful to have an alternative. Since I met @william-silversmith a few years ago, I have been curious about CloudFiles:

https://github.com/seung-lab/cloud-files/

CloudFiles was developed to access files from object storage without ever touching disk. The goal was to reliably and rapidly access a petabyte of image data broken down into tens to hundreds of millions of files being accessed in parallel across thousands of cores. The predecessor of CloudFiles, CloudVolume.Storage, the core of which is retained here, has been used to processes dozens of images, many of which were in the hundreds of terabyte range. Storage has reliably read and written tens of billions of files to date.

Highlights

  1. Fast file access with transparent threading and optionally multi-process.
  2. Google Cloud Storage, Amazon S3, local filesystems, and arbitrary web servers making hybrid or multi-cloud easy.
  3. Robust to flaky network connections. Uses exponential random window retries to avoid network collisions on a large cluster. > Validates md5 for gcs and s3.
  4. gzip, brotli, and zstd compression.
  5. Supports HTTP Range reads.
  6. Supports green threads, which are important for achieving maximum performance on virtualized servers.
  7. High efficiency transfers that avoid compression/decompression cycles.
  8. High speed gzip decompression using libdeflate (compared with zlib).
  9. Bundled CLI tool.
  10. Accepts iterator and generator input.

Today I coded up a quick CloufFiles-based store for Zarr

from cloudfiles import CloudFiles

class CloudFilesMapper:
    
    def __init__(self, path, **kwargs):
        self.cf = CloudFiles(path, **kwargs)
        
    def clear(self):
        self.cf.delete(self.cf.list())
        
    def getitems(self, keys, on_error="none"):
        return {item['path']: item['content'] for item in self.cf.get(keys, raw=True)}
    
    def setitems(self, values_dict):
        self.cf.puts([(k, v) for k, v in values_dict.items()])
        
    def delitems(self, keys):
        self.cf.delete(keys)
        
    def __getitem__(self, key):
        return self.cf.get(key)
    
    def __setitem__(self, key, value):
        self.cf.put(key, value)
        
    def __iter__(self):
        for item in self.cf.list():
            yield item

    def __len__(self):
        raise NotImplementedError

    def __delitem__(self, key):
        self.cf.delete(key)

    def __contains__(self, key):
        return self.cf.exists(key)

    def listdir(self, key):
        for item in self.cf.list(key):
            yield item.lstrip(key).lstrip('/')
            
    def rmdir(self, prefix):
        self.cf.delete(self.cf.list(prefix=prefix))

In my test with GCS, it works just fine with Zarr, Xarray, and Dask: https://nbviewer.jupyter.org/gist/rabernat/dde8b835bb7ef0590b6bf4034d5e0b2f

Distributed read performance was about 50% slower than gcsfs, but my benchmark is probably biased.

It might be useful to have the option to switch between the fsspec-based stores and this one. If folks are interested, we could think about adding this to zarr-python as some kind of optional alternative to fsspec.

Issue Analytics

  • State:open
  • Created 2 years ago
  • Reactions:3
  • Comments:7 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
martindurantcommented, Jun 21, 2021

Note that fsspec uses asyncio to fetch multiple chunks concurrently, so this can greatly increase performance by setting each dask partition to be larger than the zarr chunksize.

0reactions
jakirkhamcommented, Jun 5, 2021

Somewhat related discussion about using multiple ThreadPoolExecutors per Dask Worker from earlier today here ( https://github.com/dask/distributed/issues/4655#issuecomment-854881294 )

Read more comments on GitHub >

github_iconTop Results From Across the Web

Nexio Backup-as-a-Service: 4-Week Proof of Concept
Nexio Backup-as-a-Service is a proof of concept built upon Microsoft technologies that will fulfil the majority of today's backup requirements.
Read more >
What is a proof-of-concept (POC) in cloud computing?
In cloud computing, a proof-of-concept, or POC, is a method enabling end users to test a product or application in a virtual cloud...
Read more >
Hourglass Schemes: How to Prove that Cloud Files ... - Ari Juels
ABSTRACT. We consider the following challenge: How can a cloud storage provider prove to a tenant that it's encrypting files at rest,.
Read more >
1.autoloader-from-currents-landing - Databricks - NET
Store in variable table = dbutils.widgets.get("table") print("Table: {}".format(table)). %md ## Import required modules ... cloudFile settings ...
Read more >
Running a hybrid render farm proof of concept - Google Cloud
In this tutorial, you mostly use Cloud Shell to perform the steps, but copying data from an on-premises machine to Cloud Storage requires...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found