question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Examples to use large remote dataset(s3 or minio)

See original GitHub issue

Is your feature request related to a problem? Please describe.

I want to use remote dataset hosted in S3 or minio. Do you have any examples? Seems most examples in ludwig website are inbuilt dataset or local files. Do you have guidance using S3 or minio?

from ludwig.datasets import mushroom_edibility
dataset_df = mushroom_edibility.load()
import pandas as pd
dataset_df = pd.read_csv(dataset_path)
  1. I personally tried it import dask.dataframe as dd; dataset_df = dd.read_csv('s3://bucket/myfiles.*.csv') but notice I have to handle s3fs (required by dask). Is this a right way or there’s easier way?
  2. I also notice dataset accepts string. I am using minio for testing and could I know if it supports minio here? I want to customize endpoint and signature
    image
aws configure set default.s3.signature_version s3v4
aws --endpoint-url http://minio-service:9090 s3 ls

Describe the use case Use remote dataset

Describe the solution you’d like Provide an easy to use wrapper.

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:8

github_iconTop GitHub Comments

1reaction
tgaddaircommented, Sep 15, 2022

Hey @Jeffwan, yes we support s3 / minio and any remote object storage supported by fsspec.

Reading the data from minio with Dask is one way to do it. This is the easiest way to go if your environment is not configured to automatically connect to the remote storage backend. We provide a wrapper ludwig.utils.data_utils.use_credentials that simplifies setting credentials:

creds = {
    'client_kwargs': {
            'aws_access_key_id': 'accessKey',
            'aws_secret_access_key': 'mySecretKey',
            'endpoint_url': 'http://minio-service:9000'
        }
}
with use_credentials(creds):
    df = dask.read_csv("s3://...")

The other option is to pass a string. This also works with Minio, but it assumes that your environment is already setup to connect to s3 / minio without specifying any additional credentials. However, the endpoint_url makes this somewhat tricky with s3fs (see: https://github.com/fsspec/s3fs/issues/432). So for now I recommend providing the credentials explicitly and reading from Dask.

One thing we could do, if it would make things easier, is allow you to provide credentials (either path to credentials file or directly) within the Ludwig config, similar to how we let the user specify the cache credentials:

https://ludwig-ai.github.io/ludwig-docs/0.5/configuration/backend/

Let me know if that would help simplify things.

One last thing to note: it is true that s3fs needs to be installed to connect to s3 / minio. We decided against including this and other libraries in the requirements to save space, but let me know if it would be preferred to bake them in the Docker image.

0reactions
Jeffwancommented, Oct 5, 2022

I can confirm following ways works fine for my case. The only tricky thing is I need to use credential ENV instead client_kwargs to overcome the following issue.

    s3_creds = {
            "s3": {
                "client_kwargs": {
                    "endpoint_url": object_storage_endpoint,
                    # do not pass access_key and secret_key here, cleint_kwargs will be passed to boto3.client, so we will get 
                    # TypeError: create_client() got multiple values for keyword argument 'aws_access_key_id' error if they are configured.
                    # Let the client to read from Env.
                    # "aws_access_key_id": os.environ['AWS_ACCESS_KEY_ID'],
                    # "aws_secret_access_key": os.environ['AWS_SECRET_ACCESS_KEY'],
                }
            }
        }

 with use_credentials(s3_creds):
       xxxx # my logic

If you have following config support in future, it would save us additional efforts configuring s3_creds. This is not a blocking issue and I will close this issue now.

backend:
    credentials:
        s3:
            client_kwargs:
                endpoint_url: {{.AWS_ENDPOINT_URL}}
Read more comments on GitHub >

github_iconTop Results From Across the Web

MinIO | High Performance, Kubernetes Native Object Storage
MinIO's High Performance Object Storage is Open Source, Amazon S3 compatible, Kubernetes Native and is designed for cloud native workloads like AI.
Read more >
Run S3 Locally With MinIO for the DVC Machine Learning ...
In this article, we have mocked the S3 storage locally by lifting the MinIO object storage server. We have also defined it as...
Read more >
Manage Your Cloud Object Storage Data with the MinIO Client ...
I am using two open-source command line tools to process files in S3-based object storage — the MinIO client "mc" and rclone.
Read more >
How To Set Up MinIO Object Storage Server in Standalone ...
MinIO is an open-source object storage server compatible with the Amazon S3 cloud storage service. Applications configured to interface with ...
Read more >
Best practices for accelerating data migrations using AWS ...
Use the right tool for your job · Data transfer tools · Amazon S3 or NFS? · Accelerate your data migration · Partition...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found