question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Storage to S3, LIST operations

See original GitHub issue

I have a zarr file on S3 where I am storing data on every ten minutes. I’m using zarr version 2.3.1 and s3fs to connect to the AWS bucket. The zarr file has the following structure:

/
 └── yyyy
         └── mm
                    └── dd
                             └── HHMM
                                         └── variable
                                                    ├── array (1, 1440, 1440) int32
                                         └── variable
                                                    ├── array (1, 1440, 1440) int32

As my zarr file is growing I’m noticing an increase in costs due to LIST operations. When digging into the log files I noticed that the creation of a zarr array zarr.create() on S3 involves the listing of all the groups in the zarr file. As a LIST operation on S3 is expensive and the number of requests grows with the growing number of groups we have in the zarr file. Therefore I’m having an unsustainable situation in terms costs related to (unnecessary?) LIST operations. See a screenshot of the logs:

urllib3.util.retry DEBUG Converted retries value: False -> Retry(total=False, connect=None, read=None, redirect=0, status=None) urllib3.connectionpool DEBUG https://amazonaws.com:443 “GET /?list-type=2&prefix=file.zarr%2F2019%2F06%2F12%2F&delimiter=%2F&encoding-type=url HTTP/1.1” 200 None urllib3.util.retry DEBUG Converted retries value: False -> Retry(total=False, connect=None, read=None, redirect=0, status=None) urllib3.connectionpool DEBUG https://amazonaws.com:443 “GET /?list-type=2&prefix=file.zarr%2F2019%2F06%2F12%2F1010%2F&delimiter=%2F&encoding-type=url HTTP/1.1” 200 None urllib3.util.retry DEBUG Converted retries value: False -> Retry(total=False, connect=None, read=None, redirect=0, status=None) urllib3.connectionpool DEBUG https://amazonaws.com:443 “GET /?list-type=2&prefix=file.zarr%2F2019%2F06%2F12%2F1010%2Fcth%2F&delimiter=%2F&encoding-type=url HTTP/1.1” 200 None

Is there a work around for this that doesn’t require the listing of all the groups when pushing an array to a new group? Or is there another way of saving an array to zarr that doesn’t require that the array exists?

Thanks

Cedric

Issue Analytics

  • State:open
  • Created 4 years ago
  • Comments:7 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
alimanfoocommented, Jul 25, 2019

Another option might be to add some options to disable various checks during array creation, which are the cause of all the listing. There are two main checks we could disable, the first is a check to see if the parent group exists (and if not create it), the second is a check to see if an array or group exists at the requested path for the new array. Your code very probably does not need these checks, and so it could be reasonable to skip them.

So, e.g., a concrete solution here could be to add check_parent and check_exists keyword arguments to the array creation functions, which would be passed through to init_array. These would default to True which gives current behaviour, but the user could make them False in which case the checks would be skipped.

0reactions
Cedric-LGcommented, Aug 9, 2019

Hi @alimanfoo. We’ve looked further into the problem and discovered that the main problem was the fact that we were using overwrite=True. This was causing a listing of all the arrays in the zarr file while saving as it was checking whether the array already existed. Setting this to False greatly reduced our costs. None the less, the changes suggested above would further reduce the cost so I’ve implemented them and created a pull request #464. However, I haven’t created any unit tests yet as the version 2.3.3.dev4 is not working for me as it gives an error in the indexing when it iterates over the chunks.

Read more comments on GitHub >

github_iconTop Results From Across the Web

ListObjects - Amazon Simple Storage Service
Returns some or all (up to 1,000) of the objects in a bucket. You can use the request parameters as selection criteria to...
Read more >
S3 Bucket Operations:List Bucket Objects
Returns a list of (up to 1000) objects in a bucket. Query parameters can be used to return a portion of the objects...
Read more >
Tips for working with a large number of files in S3
S3's list -objects API returns a max of 1000 items per request, meaning you'll have to work through thousands of pages of API...
Read more >
Listing 67 Billion Objects in 1 Bucket - Pure Storage Blogs
Learn what it takes to list all keys in a single bucket with 67 billion objects and build a simple list benchmark program...
Read more >
Amazon Simple Storage Service - Complete AWS ... - IAM
Action Description Resources s3:AbortMultipartUpload This operation aborts a multipart upload. arn:aws:s3:::$b... s3:BypassGovernanceRetention ??? ??? s3:GetObjectLegalHold Gets an object's current Legal Hold status. arn:aws:s3:::$b...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found