Examples to use large remote dataset(s3 or minio)
See original GitHub issueIs your feature request related to a problem? Please describe.
I want to use remote dataset hosted in S3 or minio. Do you have any examples? Seems most examples in ludwig website are inbuilt dataset or local files. Do you have guidance using S3 or minio?
from ludwig.datasets import mushroom_edibility
dataset_df = mushroom_edibility.load()
import pandas as pd
dataset_df = pd.read_csv(dataset_path)
- I personally tried it
import dask.dataframe as dd; dataset_df = dd.read_csv('s3://bucket/myfiles.*.csv')
but notice I have to handles3fs
(required by dask). Is this a right way or there’s easier way? - I also notice dataset accepts string. I am using minio for testing and could I know if it supports minio here? I want to customize
endpoint
andsignature
aws configure set default.s3.signature_version s3v4
aws --endpoint-url http://minio-service:9090 s3 ls
Describe the use case Use remote dataset
Describe the solution you’d like Provide an easy to use wrapper.
Issue Analytics
- State:
- Created a year ago
- Comments:8
Top Results From Across the Web
MinIO | High Performance, Kubernetes Native Object Storage
MinIO's High Performance Object Storage is Open Source, Amazon S3 compatible, Kubernetes Native and is designed for cloud native workloads like AI.
Read more >Run S3 Locally With MinIO for the DVC Machine Learning ...
In this article, we have mocked the S3 storage locally by lifting the MinIO object storage server. We have also defined it as...
Read more >Manage Your Cloud Object Storage Data with the MinIO Client ...
I am using two open-source command line tools to process files in S3-based object storage — the MinIO client "mc" and rclone.
Read more >How To Set Up MinIO Object Storage Server in Standalone ...
MinIO is an open-source object storage server compatible with the Amazon S3 cloud storage service. Applications configured to interface with ...
Read more >Best practices for accelerating data migrations using AWS ...
Use the right tool for your job · Data transfer tools · Amazon S3 or NFS? · Accelerate your data migration · Partition...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Hey @Jeffwan, yes we support s3 / minio and any remote object storage supported by fsspec.
Reading the data from minio with Dask is one way to do it. This is the easiest way to go if your environment is not configured to automatically connect to the remote storage backend. We provide a wrapper
ludwig.utils.data_utils.use_credentials
that simplifies setting credentials:The other option is to pass a string. This also works with Minio, but it assumes that your environment is already setup to connect to s3 / minio without specifying any additional credentials. However, the
endpoint_url
makes this somewhat tricky with s3fs (see: https://github.com/fsspec/s3fs/issues/432). So for now I recommend providing the credentials explicitly and reading from Dask.One thing we could do, if it would make things easier, is allow you to provide credentials (either path to credentials file or directly) within the Ludwig config, similar to how we let the user specify the cache credentials:
https://ludwig-ai.github.io/ludwig-docs/0.5/configuration/backend/
Let me know if that would help simplify things.
One last thing to note: it is true that s3fs needs to be installed to connect to s3 / minio. We decided against including this and other libraries in the requirements to save space, but let me know if it would be preferred to bake them in the Docker image.
I can confirm following ways works fine for my case. The only tricky thing is I need to use credential ENV instead client_kwargs to overcome the following issue.
If you have following config support in future, it would save us additional efforts configuring s3_creds. This is not a blocking issue and I will close this issue now.