question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Dataset: Enable access to the same s3 bucket under different hosts

See original GitHub issue

Proposal Summary

In order to use ClearML for the management of datasets, an option is missing to specify the host under which a bucket can be accessed in the config without having to specify this host in the path to the object, for example when using dataset.add_external_files(...). E.g. I need to be able to call my_dataset.add_external_files(source_url="s3://training_data/cifar/image1.jpg") instead of my_dataset.add_external_files(source_url="s3://miniohost:9000/training_data/cifar/image1.jpg").

Possible solution

This could be achieved by either a new parameter on a per-bucket level (e.g. sdk.aws.s3.credentials.require_explicit_hostname with a default of true) or by setting the hostname for all buckets (e.g. sdk.aws.s3.host).

Motivation

Experiments need to run transparently on different agent setups. When there are agents in different environments where the same data is cached closer to the agents or only available at special endpoints which differ from environment to environment, I need to be able to control the host that is requested in each config of the agent and not have it hardcoded in the dataset.

Related Discussion

This (or a similar) feature was already requested previously, but closed because this would prevent a bucket with the same name to exist on e.g. AWS and a local minio server. This I guess would be a problem if you had different contents in those buckets. While this is true, the explicit requirement for this in certain situations as explained before is crucial for the transparent execution of experiments on different agents. I would recommend not to introduce this as a breaking change, but instead make this opt-in.

Issue Analytics

  • State:closed
  • Created 10 months ago
  • Comments:7 (3 by maintainers)

github_iconTop GitHub Comments

1reaction
erezalgcommented, Dec 5, 2022

@john-zielke-snkeos Just letting you know that we added documentation for path substitution here: https://clear.ml/docs/latest/docs/integrations/storage#path-substitution

Let me know if that’s clear enough 😃

Thanks for your feedback!

0reactions
john-zielke-snkeoscommented, Dec 5, 2022

Exactly, so far looks like it should work for the use case. Thank you!

Read more comments on GitHub >

github_iconTop Results From Across the Web

Managing data access with Amazon S3 access points
Simplify managing data access at scale for shared datasets by creating and using Amazon S3 access points.
Read more >
Sharing Data Among Multiple Servers Through AWS S3
In this article, we will solve this issue by creating a repository accessible to all servers where to upload the files, based on...
Read more >
10 things you should know about using AWS S3 - Sumo Logic
Learn how to optimize Amazon S3 with top tips and best practices. Bucket limits, transfer speeds, storage costs, and more – get answers...
Read more >
AWS S3 Object Copying Between AWS Accounts
AWS S3 objects are easily copied between buckets within the same account—copying object between different accounts is trickier.
Read more >
Everything You Need to Know About AWS S3 - freeCodeCamp
There are no limits on the number of files you can store in a bucket. Buckets also provide additional features such as version...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found