[BUG] Reading data from GCS creates issue
See original GitHub issueDescribe the bug Reading parquet file from Google Cloud Storage does not work.
Steps/Code to reproduce bug
dataset = nvt.Dataset("gs://bucket/file.parquet")
dataset.to_ddf().head()
Error:
cuDF failure at: ../src/table/table.cpp:42: Column size mismatch:
If the data is copied to the local disk, the code will work. cuDF / dask_cudf can read from GCS. This is with the latest NVTabular
Issue Analytics
- State:
- Created 2 years ago
- Comments:7
Top Results From Across the Web
[BUG] Reading data from GCS creates issue #1155 - GitHub
Describe the bug Reading parquet file from Google Cloud Storage does not work. Steps/Code to reproduce bug dataset = nvt.
Read more >Troubleshooting | Cloud Storage
This page describes troubleshooting methods for common errors you may encounter while using Cloud Storage. See the Google Cloud Status Dashboard for ...
Read more >Reading data from GCS with BigQuery fails with "Not Found ...
1 Answer 1 · You can add a pub/sub routine to the bucket and/or file and quick off your query after the service...
Read more >Troubleshooting Loads from Google Cloud Storage
When attempting to load data from a Google Cloud Storage (GCS) bucket, you could encounter the following error: Failure using stage area.
Read more >Unable to read newly created files in Cloud Storage via PHP ...
I have a deployed, live PHP app that involves reading and writing files to Google Cloud Storage. This has worked fine without a...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
I’m sorry for the delay here. This is indeed a bug in the optimized data-transfer logic for read_parquet from remote storage. It turns out that that the list column name is modified from “genres” to “genres.list.element” in the parquet metadata, and so we fail to transfer the data for that column. In the near future, all this logic will live directly in fsspec (and will be removed from NVTabular), but I will submit a temporary fix asap for NVT.
@rjzamora sorry it took a while. It was a bit tricky to reproduce this on a test dataset. But if you copy the transformed parquet from this (Cell 18) example to a GCS bucket and then
will give the following error
A couple of observations: (1) this seem to happen only when the list columns exists; and (2) for sufficiently large datasets (when I tried slicing the problematic dataset it seemed to work fine). Hope this help reproduce the error at your end. Thanks