question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[BUG] Reading Dataset from S3 Results in Error

See original GitHub issue

Describe the bug Hello! I was experimenting with reading a parquet dataset directly from S3 and ran into a problem.

Steps/Code to reproduce bug

import pandas as pd 
import nvtabular as nvt 

s3_save_path = "s3://my-s3-bucket/my_folder/nvt_test.parquet"
num_rows = 30000000


df = pd.DataFrame({
    "cat_col": [f"val_{i}" for i in range(num_rows)],
    "int_col": list(range(num_rows)),
})

df.to_parquet(
    s3_save_path
)
nvt.Dataset(
    s3_save_path,
    engine="parquet",
)

yields the error

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Input In [1], in <cell line: 16>()
      8 df = pd.DataFrame({
      9     "cat_col": [f"val_{i}" for i in range(num_rows)],
     10     "int_col": list(range(num_rows)),
     11 })
     13 df.to_parquet(
     14     s3_save_path
     15 )
---> 16 nvt.Dataset(
     17     s3_save_path,
     18     engine="parquet",
     19 )

File /core/merlin/io/dataset.py:304, in Dataset.__init__(self, path_or_source, engine, npartitions, part_size, part_mem_fraction, storage_options, dtypes, client, cpu, base_dataset, schema, **kwargs)
    302 if isinstance(engine, str):
    303     if engine == "parquet":
--> 304         self.engine = ParquetDatasetEngine(
    305             paths, part_size, storage_options=storage_options, cpu=self.cpu, **kwargs
    306         )
    307     elif engine == "csv":
    308         self.engine = CSVDatasetEngine(
    309             paths, part_size, storage_options=storage_options, cpu=self.cpu, **kwargs
    310         )

File /core/merlin/io/parquet.py:312, in ParquetDatasetEngine.__init__(self, paths, part_size, storage_options, row_groups_per_part, legacy, batch_size, cpu, **kwargs)
    309 self.dataset_kwargs = self.read_parquet_kwargs.pop("dataset", {})
    311 if row_groups_per_part is None:
--> 312     self._real_meta, rg_byte_size_0 = run_on_worker(
    313         _sample_row_group,
    314         self._path0,
    315         self.fs,
    316         cpu=self.cpu,
    317         memory_usage=True,
    318         **self.read_parquet_kwargs,
    319     )
    320     row_groups_per_part = self.part_size / rg_byte_size_0
    321     if row_groups_per_part < 1.0:

File /core/merlin/core/utils.py:488, in run_on_worker(func, *args, **kwargs)
    486     return dask.delayed(func)(*args, **kwargs).compute()
    487 # No Dask client - Use simple function call
--> 488 return func(*args, **kwargs)

File /core/merlin/io/parquet.py:1215, in _sample_row_group(path, fs, cpu, n, memory_usage, **kwargs)
   1213         _df = cudf.io.read_parquet(path, row_groups=0, **kwargs)
   1214     else:
-> 1215         _df = _optimized_read_remote(path, 0, None, fs, **kwargs)
   1216 _indices = list(range(n))
   1217 if memory_usage:

File /core/merlin/io/fsspec_utils.py:158, in _optimized_read_remote(path, row_groups, columns, fs, **kwargs)
    149 else:
    150     # Get byte-ranges that are known to contain the
    151     # required data for this read
    152     byte_ranges, footer, file_size = _get_parquet_byte_ranges(
    153         path, row_groups, columns, fs, **user_kwargs
    154     )
    156     return cudf.read_parquet(
    157         # Transfer the required bytes with fsspec
--> 158         io.BytesIO(
    159             _fsspec_data_transfer(
    160                 path,
    161                 fs,
    162                 byte_ranges=byte_ranges,
    163                 footer=footer,
    164                 file_size=file_size,
    165                 add_par1_magic=True,
    166                 **user_kwargs,
    167             )
    168         ),
    169         engine="cudf",
    170         columns=columns,
    171         row_groups=row_groups,
    172         strings_to_categorical=strings_to_cats,
    173         **read_kwargs,
    174     )

TypeError: a bytes-like object is required, not '_io.BytesIO'

Funnily enough, if I tune the num_rows to 28000000, this above snippet succeeds. So I am assuming this is something with the file size. I have tried playing around with the part_size and part_mem_fraction parameters but saw no changes.

Expected behavior nvt.Dataset should be able to read entire datasets from S3.

Environment details (please complete the following information): Using docker image built from merlin-training:22.04, so this is using NVT version 1.0.0. I am running this in a Jupyter Notebook running on a single V100 GPU instance.

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:9 (2 by maintainers)

github_iconTop GitHub Comments

1reaction
shoyasaxacommented, May 3, 2022

My apologies - upon a second look it looks like s3fs’s version is not what it should be. I will clean this up a little more and get back to you!

1reaction
shoyasaxacommented, Apr 27, 2022

Ah yep looks like doing pip install git+https://github.com/fsspec/s3fs.git@2022.3.0 fixes the issue! Thank you for that pointer!

Read more comments on GitHub >

github_iconTop Results From Across the Web

Resolve errors uploading data to or downloading data ... - AWS
First, follow the steps below to run the SELECT INTO OUTFILE S3 or LOAD DATA FROM S3 commands using Amazon Aurora. If you...
Read more >
[BUG] Error when trying to use AWS S3 as dataset #486 - GitHub
I start running the program using the S3 storage, and it is seems like it is reading data but after a while (10...
Read more >
Overflowerror when reading from s3 - signed integer is greater ...
This example uses a memoryview to store the final results to avoid building up a byte array as the data is read, which...
Read more >
Step 6. Resolve Data Load Errors Related to Data Issues
The following process returns errors by query ID and saves the results to a table for future reference. You can view the query...
Read more >
uploading to S3 fatal error: Parameter validation failed | Medium
Let's say you get this error trying to upload to an S3 bucket to a bucket you know already exists and you're sure...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found