Dataset: Enable access to the same s3 bucket under different hosts
See original GitHub issueProposal Summary
In order to use ClearML for the management of datasets, an option is missing to specify the host under which a bucket can be accessed in the config without having to specify this host in the path to the object, for example when using dataset.add_external_files(...)
.
E.g. I need to be able to call my_dataset.add_external_files(source_url="s3://training_data/cifar/image1.jpg")
instead of my_dataset.add_external_files(source_url="s3://miniohost:9000/training_data/cifar/image1.jpg")
.
Possible solution
This could be achieved by either a new parameter on a per-bucket level (e.g. sdk.aws.s3.credentials.require_explicit_hostname with a default of true) or by setting the hostname for all buckets (e.g. sdk.aws.s3.host).
Motivation
Experiments need to run transparently on different agent setups. When there are agents in different environments where the same data is cached closer to the agents or only available at special endpoints which differ from environment to environment, I need to be able to control the host that is requested in each config of the agent and not have it hardcoded in the dataset.
Related Discussion
This (or a similar) feature was already requested previously, but closed because this would prevent a bucket with the same name to exist on e.g. AWS and a local minio server. This I guess would be a problem if you had different contents in those buckets. While this is true, the explicit requirement for this in certain situations as explained before is crucial for the transparent execution of experiments on different agents. I would recommend not to introduce this as a breaking change, but instead make this opt-in.
Issue Analytics
- State:
- Created 10 months ago
- Comments:7 (3 by maintainers)
@john-zielke-snkeos Just letting you know that we added documentation for path substitution here: https://clear.ml/docs/latest/docs/integrations/storage#path-substitution
Let me know if that’s clear enough 😃
Thanks for your feedback!
Exactly, so far looks like it should work for the use case. Thank you!