Storage to S3, LIST operations
See original GitHub issueI have a zarr file on S3 where I am storing data on every ten minutes. I’m using zarr version 2.3.1 and s3fs to connect to the AWS bucket. The zarr file has the following structure:
/
└── yyyy
└── mm
└── dd
└── HHMM
└── variable
├── array (1, 1440, 1440) int32
└── variable
├── array (1, 1440, 1440) int32
As my zarr file is growing I’m noticing an increase in costs due to LIST operations. When digging into the log files I noticed that the creation of a zarr array zarr.create()
on S3 involves the listing of all the groups in the zarr file. As a LIST operation on S3 is expensive and the number of requests grows with the growing number of groups we have in the zarr file. Therefore I’m having an unsustainable situation in terms costs related to (unnecessary?) LIST operations. See a screenshot of the logs:
urllib3.util.retry DEBUG Converted retries value: False -> Retry(total=False, connect=None, read=None, redirect=0, status=None) urllib3.connectionpool DEBUG https://amazonaws.com:443 “GET /?list-type=2&prefix=file.zarr%2F2019%2F06%2F12%2F&delimiter=%2F&encoding-type=url HTTP/1.1” 200 None urllib3.util.retry DEBUG Converted retries value: False -> Retry(total=False, connect=None, read=None, redirect=0, status=None) urllib3.connectionpool DEBUG https://amazonaws.com:443 “GET /?list-type=2&prefix=file.zarr%2F2019%2F06%2F12%2F1010%2F&delimiter=%2F&encoding-type=url HTTP/1.1” 200 None urllib3.util.retry DEBUG Converted retries value: False -> Retry(total=False, connect=None, read=None, redirect=0, status=None) urllib3.connectionpool DEBUG https://amazonaws.com:443 “GET /?list-type=2&prefix=file.zarr%2F2019%2F06%2F12%2F1010%2Fcth%2F&delimiter=%2F&encoding-type=url HTTP/1.1” 200 None
Is there a work around for this that doesn’t require the listing of all the groups when pushing an array to a new group? Or is there another way of saving an array to zarr that doesn’t require that the array exists?
Thanks
Cedric
Issue Analytics
- State:
- Created 4 years ago
- Comments:7 (4 by maintainers)
So, e.g., a concrete solution here could be to add
check_parent
andcheck_exists
keyword arguments to the array creation functions, which would be passed through to init_array. These would default to True which gives current behaviour, but the user could make them False in which case the checks would be skipped.Hi @alimanfoo. We’ve looked further into the problem and discovered that the main problem was the fact that we were using
overwrite=True
. This was causing a listing of all the arrays in the zarr file while saving as it was checking whether the array already existed. Setting this toFalse
greatly reduced our costs. None the less, the changes suggested above would further reduce the cost so I’ve implemented them and created a pull request #464. However, I haven’t created any unit tests yet as the version 2.3.3.dev4 is not working for me as it gives an error in the indexing when it iterates over the chunks.