question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Loading Data From S3 Path in Sagemaker

See original GitHub issue

In Sagemaker Im tring to load the data set from S3 path as follows

`train_path = ‘s3://xxxxxxxxxx/xxxxxxxxxx/train.csv’ valid_path = ‘s3://xxxxxxxxxx/xxxxxxxxxx/validation.csv’ test_path = ‘s3://xxxxxxxxxx/xxxxxxxxxx/test.csv’

data_files = {}
data_files["train"] = train_path
data_files["validation"] = valid_path
data_files["test"] = test_path
extension = train_path.split(".")[-1]
datasets = load_dataset(extension, data_files=data_files, s3_enabled=True)
print(datasets)`

I getting an error of

algo-1-7plil_1 | File "main.py", line 21, in <module> algo-1-7plil_1 | datasets = load_dataset(extension, data_files=data_files) algo-1-7plil_1 | File "/opt/conda/lib/python3.6/site-packages/datasets/load.py", line 603, in load_dataset algo-1-7plil_1 | **config_kwargs, algo-1-7plil_1 | File "/opt/conda/lib/python3.6/site-packages/datasets/builder.py", line 155, in __init__ algo-1-7plil_1 | **config_kwargs, algo-1-7plil_1 | File "/opt/conda/lib/python3.6/site-packages/datasets/builder.py", line 305, in _create_builder_config algo-1-7plil_1 | m.update(str(os.path.getmtime(data_file))) algo-1-7plil_1 | File "/opt/conda/lib/python3.6/genericpath.py", line 55, in getmtime algo-1-7plil_1 | return os.stat(filename).st_mtime algo-1-7plil_1 | FileNotFoundError: [Errno 2] No such file or directory: 's3://lsmv-sagemaker/pubmedbert/test.csv

But when im trying with pandas , it is able to load from S3

Does the datasets library support S3 path to load

Issue Analytics

  • State:open
  • Created 3 years ago
  • Comments:16 (7 by maintainers)

github_iconTop GitHub Comments

1reaction
philschmidcommented, Dec 23, 2020

@dorlavie are you using sagemaker for training too? Then you could use S3 URI, for example s3://my-bucket/my-training-data and pass it within the .fit() function when you start the sagemaker training job. Sagemaker would then download the data from s3 into the training runtime and you could load it from disk

sagemaker start training job

pytorch_estimator.fit({'train':'s3://my-bucket/my-training-data','eval':'s3://my-bucket/my-evaluation-data'})

in the train.py script

from datasets import load_from_disk

train_dataset = load_from_disk(os.environ['SM_CHANNEL_TRAIN'])

I have created an example of how to use transformers and datasets with sagemaker. https://github.com/philschmid/huggingface-sagemaker-example/tree/main/03_huggingface_sagemaker_trainer_with_data_from_s3

The example contains a jupyter notebook sagemaker-example.ipynb and an src/ folder. The sagemaker-example is a jupyter notebook that is used to create the training job on AWS Sagemaker. The src/ folder contains the train.py, our training script, and requirements.txt for additional dependencies.

1reaction
thomwolfcommented, Nov 23, 2020

We were brainstorming around your use-case.

Let’s keep the issue open for now, I think this is an interesting question to think about.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Load S3 Data into AWS SageMaker Notebook - Stack Overflow
I've just started to experiment with AWS SageMaker and would like to load data from an S3 bucket into a ...
Read more >
How to load data from S3 to AWS SageMaker
How to load data from S3 to AWS SageMaker · import boto3 s3 = boto3.resource('s3') bucket = s3. · for bucket in s3.buckets....
Read more >
How To Load Data From AWS S3 Into Sagemaker (Using ...
Loading CSV file from S3 Bucket using Boto3 · Import pandas package to read csv file as a dataframe · Create a variable...
Read more >
Import - Amazon SageMaker - AWS Documentation
If you are not currently on the Import tab, choose Import. Under Available, choose Amazon S3 to see the Import S3 Data Source...
Read more >
Upload the data to S3 - Amazon Sagemaker Workshop
First you need to create a bucket for this experiment. Upload the data from the following public location to your own S3 bucket....
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found