Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[BUG] Reading Dataset from S3 Results in Error

See original GitHub issue

Describe the bug Hello! I was experimenting with reading a parquet dataset directly from S3 and ran into a problem.

Steps/Code to reproduce bug

import pandas as pd 
import nvtabular as nvt 

s3_save_path = "s3://my-s3-bucket/my_folder/nvt_test.parquet"
num_rows = 30000000


df = pd.DataFrame({
    "cat_col": [f"val_{i}" for i in range(num_rows)],
    "int_col": list(range(num_rows)),
})

df.to_parquet(
    s3_save_path
)
nvt.Dataset(
    s3_save_path,
    engine="parquet",
)

yields the error

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Input In [1], in <cell line: 16>()
      8 df = pd.DataFrame({
      9     "cat_col": [f"val_{i}" for i in range(num_rows)],
     10     "int_col": list(range(num_rows)),
     11 })
     13 df.to_parquet(
     14     s3_save_path
     15 )
---> 16 nvt.Dataset(
     17     s3_save_path,
     18     engine="parquet",
     19 )

File /core/merlin/io/dataset.py:304, in Dataset.__init__(self, path_or_source, engine, npartitions, part_size, part_mem_fraction, storage_options, dtypes, client, cpu, base_dataset, schema, **kwargs)
    302 if isinstance(engine, str):
    303     if engine == "parquet":
--> 304         self.engine = ParquetDatasetEngine(
    305             paths, part_size, storage_options=storage_options, cpu=self.cpu, **kwargs
    306         )
    307     elif engine == "csv":
    308         self.engine = CSVDatasetEngine(
    309             paths, part_size, storage_options=storage_options, cpu=self.cpu, **kwargs
    310         )

File /core/merlin/io/parquet.py:312, in ParquetDatasetEngine.__init__(self, paths, part_size, storage_options, row_groups_per_part, legacy, batch_size, cpu, **kwargs)
    309 self.dataset_kwargs = self.read_parquet_kwargs.pop("dataset", {})
    311 if row_groups_per_part is None:
--> 312     self._real_meta, rg_byte_size_0 = run_on_worker(
    313         _sample_row_group,
    314         self._path0,
    315         self.fs,
    316         cpu=self.cpu,
    317         memory_usage=True,
    318         **self.read_parquet_kwargs,
    319     )
    320     row_groups_per_part = self.part_size / rg_byte_size_0
    321     if row_groups_per_part < 1.0:

File /core/merlin/core/utils.py:488, in run_on_worker(func, *args, **kwargs)
    486     return dask.delayed(func)(*args, **kwargs).compute()
    487 # No Dask client - Use simple function call
--> 488 return func(*args, **kwargs)

File /core/merlin/io/parquet.py:1215, in _sample_row_group(path, fs, cpu, n, memory_usage, **kwargs)
   1213         _df = cudf.io.read_parquet(path, row_groups=0, **kwargs)
   1214     else:
-> 1215         _df = _optimized_read_remote(path, 0, None, fs, **kwargs)
   1216 _indices = list(range(n))
   1217 if memory_usage:

File /core/merlin/io/fsspec_utils.py:158, in _optimized_read_remote(path, row_groups, columns, fs, **kwargs)
    149 else:
    150     # Get byte-ranges that are known to contain the
    151     # required data for this read
    152     byte_ranges, footer, file_size = _get_parquet_byte_ranges(
    153         path, row_groups, columns, fs, **user_kwargs
    154     )
    156     return cudf.read_parquet(
    157         # Transfer the required bytes with fsspec
--> 158         io.BytesIO(
    159             _fsspec_data_transfer(
    160                 path,
    161                 fs,
    162                 byte_ranges=byte_ranges,
    163                 footer=footer,
    164                 file_size=file_size,
    165                 add_par1_magic=True,
    166                 **user_kwargs,
    167             )
    168         ),
    169         engine="cudf",
    170         columns=columns,
    171         row_groups=row_groups,
    172         strings_to_categorical=strings_to_cats,
    173         **read_kwargs,
    174     )

TypeError: a bytes-like object is required, not '_io.BytesIO'

Funnily enough, if I tune the num_rows to 28000000, this above snippet succeeds. So I am assuming this is something with the file size. I have tried playing around with the part_size and part_mem_fraction parameters but saw no changes.

Expected behavior nvt.Dataset should be able to read entire datasets from S3.

Environment details (please complete the following information): Using docker image built from merlin-training:22.04, so this is using NVT version 1.0.0. I am running this in a Jupyter Notebook running on a single V100 GPU instance.