Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[BUG] Reading data from GCS creates issue

See original GitHub issue

Describe the bug Reading parquet file from Google Cloud Storage does not work.

Steps/Code to reproduce bug

dataset = nvt.Dataset("gs://bucket/file.parquet")
dataset.to_ddf().head()

Error:

cuDF failure at: ../src/table/table.cpp:42: Column size mismatch:

If the data is copied to the local disk, the code will work. cuDF / dask_cudf can read from GCS. This is with the latest NVTabular

Issue Analytics

State:
Created 2 years ago
Comments:7

Top GitHub Comments

2reactions

rjzamoracommented, Oct 26, 2021

I’m sorry for the delay here. This is indeed a bug in the optimized data-transfer logic for read_parquet from remote storage. It turns out that that the list column name is modified from “genres” to “genres.list.element” in the parquet metadata, and so we fail to transfer the data for that column. In the near future, all this logic will live directly in fsspec (and will be removed from NVTabular), but I will submit a temporary fix asap for NVT.

0reactions

pchandarcommented, Oct 14, 2021

@rjzamora sorry it took a while. It was a bit tricky to reproduce this on a test dataset. But if you copy the transformed parquet from this (Cell 18) example to a GCS bucket and then

ds = nvt.Dataset("gs://bucket/movielens.parquet")
ds.head()

will give the following error

RuntimeError: cuDF failure at: ../src/table/table.cpp:42: Column size mismatch: 76 != 20000076

A couple of observations: (1) this seem to happen only when the list columns exists; and (2) for sufficiently large datasets (when I tried slicing the problematic dataset it seemed to work fine). Hope this help reproduce the error at your end. Thanks

Top Results From Across the Web

[BUG] Reading data from GCS creates issue #1155 - GitHub

Describe the bug Reading parquet file from Google Cloud Storage does not work. Steps/Code to reproduce bug dataset = nvt.

Troubleshooting | Cloud Storage

This page describes troubleshooting methods for common errors you may encounter while using Cloud Storage. See the Google Cloud Status Dashboard for ...

Reading data from GCS with BigQuery fails with "Not Found ...

1 Answer 1 · You can add a pub/sub routine to the bucket and/or file and quick off your query after the service...

Troubleshooting Loads from Google Cloud Storage

When attempting to load data from a Google Cloud Storage (GCS) bucket, you could encounter the following error: Failure using stage area.

Unable to read newly created files in Cloud Storage via PHP ...

I have a deployed, live PHP app that involves reading and writing files to Google Cloud Storage. This has worked fine without a...