Loading Data From S3 Path in Sagemaker
See original GitHub issueIn Sagemaker Im tring to load the data set from S3 path as follows
`train_path = ‘s3://xxxxxxxxxx/xxxxxxxxxx/train.csv’ valid_path = ‘s3://xxxxxxxxxx/xxxxxxxxxx/validation.csv’ test_path = ‘s3://xxxxxxxxxx/xxxxxxxxxx/test.csv’
data_files = {}
data_files["train"] = train_path
data_files["validation"] = valid_path
data_files["test"] = test_path
extension = train_path.split(".")[-1]
datasets = load_dataset(extension, data_files=data_files, s3_enabled=True)
print(datasets)`
I getting an error of
algo-1-7plil_1 | File "main.py", line 21, in <module> algo-1-7plil_1 | datasets = load_dataset(extension, data_files=data_files) algo-1-7plil_1 | File "/opt/conda/lib/python3.6/site-packages/datasets/load.py", line 603, in load_dataset algo-1-7plil_1 | **config_kwargs, algo-1-7plil_1 | File "/opt/conda/lib/python3.6/site-packages/datasets/builder.py", line 155, in __init__ algo-1-7plil_1 | **config_kwargs, algo-1-7plil_1 | File "/opt/conda/lib/python3.6/site-packages/datasets/builder.py", line 305, in _create_builder_config algo-1-7plil_1 | m.update(str(os.path.getmtime(data_file))) algo-1-7plil_1 | File "/opt/conda/lib/python3.6/genericpath.py", line 55, in getmtime algo-1-7plil_1 | return os.stat(filename).st_mtime algo-1-7plil_1 | FileNotFoundError: [Errno 2] No such file or directory: 's3://lsmv-sagemaker/pubmedbert/test.csv
But when im trying with pandas , it is able to load from S3
Does the datasets library support S3 path to load
Issue Analytics
- State:
- Created 3 years ago
- Comments:16 (7 by maintainers)
@dorlavie are you using sagemaker for training too? Then you could use S3 URI, for example
s3://my-bucket/my-training-data
and pass it within the.fit()
function when you start the sagemaker training job. Sagemaker would then download the data from s3 into the training runtime and you could load it from disksagemaker start training job
in the train.py script
I have created an example of how to use transformers and datasets with sagemaker. https://github.com/philschmid/huggingface-sagemaker-example/tree/main/03_huggingface_sagemaker_trainer_with_data_from_s3
The example contains a jupyter notebook
sagemaker-example.ipynb
and ansrc/
folder. The sagemaker-example is a jupyter notebook that is used to create the training job on AWS Sagemaker. Thesrc/
folder contains thetrain.py
, our training script, andrequirements.txt
for additional dependencies.We were brainstorming around your use-case.
Let’s keep the issue open for now, I think this is an interesting question to think about.