[BUG] Reading Dataset from S3 Results in Error
See original GitHub issueDescribe the bug Hello! I was experimenting with reading a parquet dataset directly from S3 and ran into a problem.
Steps/Code to reproduce bug
import pandas as pd
import nvtabular as nvt
s3_save_path = "s3://my-s3-bucket/my_folder/nvt_test.parquet"
num_rows = 30000000
df = pd.DataFrame({
"cat_col": [f"val_{i}" for i in range(num_rows)],
"int_col": list(range(num_rows)),
})
df.to_parquet(
s3_save_path
)
nvt.Dataset(
s3_save_path,
engine="parquet",
)
yields the error
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Input In [1], in <cell line: 16>()
8 df = pd.DataFrame({
9 "cat_col": [f"val_{i}" for i in range(num_rows)],
10 "int_col": list(range(num_rows)),
11 })
13 df.to_parquet(
14 s3_save_path
15 )
---> 16 nvt.Dataset(
17 s3_save_path,
18 engine="parquet",
19 )
File /core/merlin/io/dataset.py:304, in Dataset.__init__(self, path_or_source, engine, npartitions, part_size, part_mem_fraction, storage_options, dtypes, client, cpu, base_dataset, schema, **kwargs)
302 if isinstance(engine, str):
303 if engine == "parquet":
--> 304 self.engine = ParquetDatasetEngine(
305 paths, part_size, storage_options=storage_options, cpu=self.cpu, **kwargs
306 )
307 elif engine == "csv":
308 self.engine = CSVDatasetEngine(
309 paths, part_size, storage_options=storage_options, cpu=self.cpu, **kwargs
310 )
File /core/merlin/io/parquet.py:312, in ParquetDatasetEngine.__init__(self, paths, part_size, storage_options, row_groups_per_part, legacy, batch_size, cpu, **kwargs)
309 self.dataset_kwargs = self.read_parquet_kwargs.pop("dataset", {})
311 if row_groups_per_part is None:
--> 312 self._real_meta, rg_byte_size_0 = run_on_worker(
313 _sample_row_group,
314 self._path0,
315 self.fs,
316 cpu=self.cpu,
317 memory_usage=True,
318 **self.read_parquet_kwargs,
319 )
320 row_groups_per_part = self.part_size / rg_byte_size_0
321 if row_groups_per_part < 1.0:
File /core/merlin/core/utils.py:488, in run_on_worker(func, *args, **kwargs)
486 return dask.delayed(func)(*args, **kwargs).compute()
487 # No Dask client - Use simple function call
--> 488 return func(*args, **kwargs)
File /core/merlin/io/parquet.py:1215, in _sample_row_group(path, fs, cpu, n, memory_usage, **kwargs)
1213 _df = cudf.io.read_parquet(path, row_groups=0, **kwargs)
1214 else:
-> 1215 _df = _optimized_read_remote(path, 0, None, fs, **kwargs)
1216 _indices = list(range(n))
1217 if memory_usage:
File /core/merlin/io/fsspec_utils.py:158, in _optimized_read_remote(path, row_groups, columns, fs, **kwargs)
149 else:
150 # Get byte-ranges that are known to contain the
151 # required data for this read
152 byte_ranges, footer, file_size = _get_parquet_byte_ranges(
153 path, row_groups, columns, fs, **user_kwargs
154 )
156 return cudf.read_parquet(
157 # Transfer the required bytes with fsspec
--> 158 io.BytesIO(
159 _fsspec_data_transfer(
160 path,
161 fs,
162 byte_ranges=byte_ranges,
163 footer=footer,
164 file_size=file_size,
165 add_par1_magic=True,
166 **user_kwargs,
167 )
168 ),
169 engine="cudf",
170 columns=columns,
171 row_groups=row_groups,
172 strings_to_categorical=strings_to_cats,
173 **read_kwargs,
174 )
TypeError: a bytes-like object is required, not '_io.BytesIO'
Funnily enough, if I tune the num_rows
to 28000000
, this above snippet succeeds. So I am assuming this is something with the file size. I have tried playing around with the part_size
and part_mem_fraction
parameters but saw no changes.
Expected behavior
nvt.Dataset
should be able to read entire datasets from S3.
Environment details (please complete the following information):
Using docker image built from merlin-training:22.04
, so this is using NVT version 1.0.0. I am running this in a Jupyter Notebook running on a single V100 GPU instance.
Issue Analytics
- State:
- Created a year ago
- Comments:9 (2 by maintainers)
Top Results From Across the Web
Resolve errors uploading data to or downloading data ... - AWS
First, follow the steps below to run the SELECT INTO OUTFILE S3 or LOAD DATA FROM S3 commands using Amazon Aurora. If you...
Read more >[BUG] Error when trying to use AWS S3 as dataset #486 - GitHub
I start running the program using the S3 storage, and it is seems like it is reading data but after a while (10...
Read more >Overflowerror when reading from s3 - signed integer is greater ...
This example uses a memoryview to store the final results to avoid building up a byte array as the data is read, which...
Read more >Step 6. Resolve Data Load Errors Related to Data Issues
The following process returns errors by query ID and saves the results to a table for future reference. You can view the query...
Read more >uploading to S3 fatal error: Parameter validation failed | Medium
Let's say you get this error trying to upload to an S3 bucket to a bucket you know already exists and you're sure...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
My apologies - upon a second look it looks like s3fs’s version is not what it should be. I will clean this up a little more and get back to you!
Ah yep looks like doing
pip install git+https://github.com/fsspec/s3fs.git@2022.3.0
fixes the issue! Thank you for that pointer!