question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[BUG] Reading data from GCS creates issue

See original GitHub issue

Describe the bug Reading parquet file from Google Cloud Storage does not work.

Steps/Code to reproduce bug

dataset = nvt.Dataset("gs://bucket/file.parquet")
dataset.to_ddf().head()

Error:

cuDF failure at: ../src/table/table.cpp:42: Column size mismatch:

If the data is copied to the local disk, the code will work. cuDF / dask_cudf can read from GCS. This is with the latest NVTabular

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:7

github_iconTop GitHub Comments

2reactions
rjzamoracommented, Oct 26, 2021

I’m sorry for the delay here. This is indeed a bug in the optimized data-transfer logic for read_parquet from remote storage. It turns out that that the list column name is modified from “genres” to “genres.list.element” in the parquet metadata, and so we fail to transfer the data for that column. In the near future, all this logic will live directly in fsspec (and will be removed from NVTabular), but I will submit a temporary fix asap for NVT.

0reactions
pchandarcommented, Oct 14, 2021

@rjzamora sorry it took a while. It was a bit tricky to reproduce this on a test dataset. But if you copy the transformed parquet from this (Cell 18) example to a GCS bucket and then

ds = nvt.Dataset("gs://bucket/movielens.parquet")
ds.head()

will give the following error

RuntimeError: cuDF failure at: ../src/table/table.cpp:42: Column size mismatch: 76 != 20000076

A couple of observations: (1) this seem to happen only when the list columns exists; and (2) for sufficiently large datasets (when I tried slicing the problematic dataset it seemed to work fine). Hope this help reproduce the error at your end. Thanks

Read more comments on GitHub >

github_iconTop Results From Across the Web

[BUG] Reading data from GCS creates issue #1155 - GitHub
Describe the bug Reading parquet file from Google Cloud Storage does not work. Steps/Code to reproduce bug dataset = nvt.
Read more >
Troubleshooting | Cloud Storage
This page describes troubleshooting methods for common errors you may encounter while using Cloud Storage. See the Google Cloud Status Dashboard for ...
Read more >
Reading data from GCS with BigQuery fails with "Not Found ...
1 Answer 1 · You can add a pub/sub routine to the bucket and/or file and quick off your query after the service...
Read more >
Troubleshooting Loads from Google Cloud Storage
When attempting to load data from a Google Cloud Storage (GCS) bucket, you could encounter the following error: Failure using stage area.
Read more >
Unable to read newly created files in Cloud Storage via PHP ...
I have a deployed, live PHP app that involves reading and writing files to Google Cloud Storage. This has worked fine without a...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found