[BUG] "Schemas are inconsistent" error for parquet files which have the same dtype, but are different in not null setting
See original GitHub issueDescribe the bug This bug did not ocurr with NVT 0.2, but now occurs with the main branch (future NVT 0.3). It us raised the error “Schemas are inconsistent” when the parquet files in the same folder share the same columns and dtypes, but there are null values <NA> for some column in one of the parquet files, and not in the corresponding column of another parquet file. But it is raised an error when an NVT dataset is instantiated and we try to head() its first elements, like
ds = nvt.Dataset(PATH, engine="parquet", part_size="1000MB")
ds.to_ddf().head()
The error raised then is:
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
/opt/conda/envs/rapids/lib/python3.7/site-packages/dask/dataframe/io/parquet/arrow.py in _append_row_groups(metadata, md)
33 try:
---> 34 metadata.append_row_groups(md)
35 except RuntimeError as err:
/opt/conda/envs/rapids/lib/python3.7/site-packages/pyarrow/_parquet.pyx in pyarrow._parquet.FileMetaData.append_row_groups()
RuntimeError: AppendRowGroups requires equal schemas.
The above exception was the direct cause of the following exception:
RuntimeError Traceback (most recent call last)
<ipython-input-4-b3788a1214d6> in <module>
----> 1 ds.to_ddf().head()
/nvtabular0.3/NVTabular/nvtabular/io/dataset.py in to_ddf(self, columns, shuffle, seed)
263 """
264 # Use DatasetEngine to create ddf
--> 265 ddf = self.engine.to_ddf(columns=columns)
266
267 # Shuffle the partitions of ddf (optional)
/nvtabular0.3/NVTabular/nvtabular/io/parquet.py in to_ddf(self, columns)
103 gather_statistics=False,
104 split_row_groups=self.row_groups_per_part,
--> 105 storage_options=self.storage_options,
106 )
107
/opt/conda/envs/rapids/lib/python3.7/site-packages/dask_cudf/io/parquet.py in read_parquet(path, columns, split_row_groups, row_groups_per_part, **kwargs)
192 split_row_groups=split_row_groups,
193 engine=CudfEngine,
--> 194 **kwargs,
195 )
196
/opt/conda/envs/rapids/lib/python3.7/site-packages/dask/dataframe/io/parquet/core.py in read_parquet(path, columns, filters, categories, index, storage_options, engine, gather_statistics, split_row_groups, chunksize, **kwargs)
237 filters=filters,
238 split_row_groups=split_row_groups,
--> 239 **kwargs,
240 )
241
/opt/conda/envs/rapids/lib/python3.7/site-packages/dask_cudf/io/parquet.py in read_metadata(*args, **kwargs)
15 @staticmethod
16 def read_metadata(*args, **kwargs):
---> 17 meta, stats, parts, index = ArrowEngine.read_metadata(*args, **kwargs)
18
19 # If `strings_to_categorical==True`, convert objects to int32
/opt/conda/envs/rapids/lib/python3.7/site-packages/dask/dataframe/io/parquet/arrow.py in read_metadata(cls, fs, paths, categories, index, gather_statistics, filters, split_row_groups, **kwargs)
654 gather_statistics,
655 ) = _gather_metadata(
--> 656 paths, fs, split_row_groups, gather_statistics, filters, dataset_kwargs
657 )
658
/opt/conda/envs/rapids/lib/python3.7/site-packages/dask/dataframe/io/parquet/arrow.py in _gather_metadata(paths, fs, split_row_groups, gather_statistics, filters, dataset_kwargs)
246 md.set_file_path(fn)
247 if metadata:
--> 248 _append_row_groups(metadata, md)
249 else:
250 metadata = md
/opt/conda/envs/rapids/lib/python3.7/site-packages/dask/dataframe/io/parquet/arrow.py in _append_row_groups(metadata, md)
40 "pyarrow schema. Such as "
41 '`to_parquet(..., schema={"column1": pa.string()})`'
---> 42 ) from err
43 else:
44 raise err
RuntimeError: Schemas are inconsistent, try using `to_parquet(..., schema="infer")`, or pass an explicit pyarrow schema. Such as `to_parquet(..., schema={"column1": pa.string()})`
By using this script from @rjzamora , it was possible to check that the metadata for the parquet files differs because columns are not null for one file and nullable for the other, that contains nulls.
ValueError: Schema in /gfn-merlin/gmoreira/data/debug/gfn_problematic_columns//2020-08-30.parquet was different.
DayOfWeekUTC: string not null
-- field metadata --
PARQUET:field_id: '1'
MonthUTC: string not null
-- field metadata --
PARQUET:field_id: '2'
HourOfDayUTC: float not null
-- field metadata --
PARQUET:field_id: '3'
WeekNumberUTC: float not null
-- field metadata --
PARQUET:field_id: '4'
-- schema metadata --
pandas: '{"index_columns": [], "column_indexes": [{"name": null, "field_n' + 685
vs
DayOfWeekUTC: string
-- field metadata --
PARQUET:field_id: '1'
MonthUTC: string
-- field metadata --
PARQUET:field_id: '2'
HourOfDayUTC: float
-- field metadata --
PARQUET:field_id: '3'
WeekNumberUTC: float
-- field metadata --
PARQUET:field_id: '4'
-- schema metadata --
pandas: '{"index_columns": [], "column_indexes": [{"name": null, "field_n' + 685
BTW, the two parquet files can be loaded individually using dask_cudf. But when they are loaded together (e.g. pointing to a directory with the two files)
df = dask_cudf.read_parquet(PATH).compute()
the following error is raised by dask_cudf
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
<ipython-input-8-a4ba9b6fbe05> in <module>
----> 1 df2 = df.compute()
/opt/conda/envs/rapids/lib/python3.7/site-packages/dask/base.py in compute(self, **kwargs)
165 dask.base.compute
166 """
--> 167 (result,) = compute(self, traverse=False, **kwargs)
168 return result
169
/opt/conda/envs/rapids/lib/python3.7/site-packages/dask/base.py in compute(*args, **kwargs)
450 postcomputes.append(x.__dask_postcompute__())
451
--> 452 results = schedule(dsk, keys, **kwargs)
453 return repack([f(r, *a) for r, (f, a) in zip(results, postcomputes)])
454
/opt/conda/envs/rapids/lib/python3.7/site-packages/dask/local.py in get_sync(dsk, keys, **kwargs)
525 """
526 kwargs.pop("num_workers", None) # if num_workers present, remove it
--> 527 return get_async(apply_sync, 1, dsk, keys, **kwargs)
528
529
/opt/conda/envs/rapids/lib/python3.7/site-packages/dask/local.py in get_async(apply_async, num_workers, dsk, result, cache, get_id, rerun_exceptions_locally, pack_exception, raise_exception, callbacks, dumps, loads, **kwargs)
492
493 while state["ready"] and len(state["running"]) < num_workers:
--> 494 fire_task()
495
496 succeeded = True
/opt/conda/envs/rapids/lib/python3.7/site-packages/dask/local.py in fire_task()
464 pack_exception,
465 ),
--> 466 callback=queue.put,
467 )
468
/opt/conda/envs/rapids/lib/python3.7/site-packages/dask/local.py in apply_sync(func, args, kwds, callback)
514 def apply_sync(func, args=(), kwds={}, callback=None):
515 """ A naive synchronous version of apply_async """
--> 516 res = func(*args, **kwds)
517 if callback is not None:
518 callback(res)
/opt/conda/envs/rapids/lib/python3.7/site-packages/dask/local.py in execute_task(key, task_info, dumps, loads, get_id, pack_exception)
225 failed = False
226 except BaseException as e:
--> 227 result = pack_exception(e, dumps)
228 failed = True
229 return key, result, failed
/opt/conda/envs/rapids/lib/python3.7/site-packages/dask/local.py in execute_task(key, task_info, dumps, loads, get_id, pack_exception)
220 try:
221 task, data = loads(task_info)
--> 222 result = _execute_task(task, data)
223 id = get_id()
224 result = dumps((result, id))
/opt/conda/envs/rapids/lib/python3.7/site-packages/dask/core.py in _execute_task(arg, cache, dsk)
119 # temporaries by their reference count and can execute certain
120 # operations in-place.
--> 121 return func(*(_execute_task(a, cache) for a in args))
122 elif not ishashable(arg):
123 return arg
/opt/conda/envs/rapids/lib/python3.7/site-packages/dask/dataframe/io/parquet/core.py in read_parquet_part(func, fs, meta, part, columns, index, kwargs)
274 This function is used by `read_parquet`."""
275 if isinstance(part, list):
--> 276 dfs = [func(fs, rg, columns.copy(), index, **kwargs) for rg in part]
277 df = concat(dfs, axis=0)
278 else:
/opt/conda/envs/rapids/lib/python3.7/site-packages/dask/dataframe/io/parquet/core.py in <listcomp>(.0)
274 This function is used by `read_parquet`."""
275 if isinstance(part, list):
--> 276 dfs = [func(fs, rg, columns.copy(), index, **kwargs) for rg in part]
277 df = concat(dfs, axis=0)
278 else:
/opt/conda/envs/rapids/lib/python3.7/site-packages/dask_cudf/io/parquet.py in read_partition(fs, piece, columns, index, categories, partitions, **kwargs)
55 row_groups=row_group,
56 strings_to_categorical=strings_to_cats,
---> 57 **kwargs.get("read", {}),
58 )
59 else:
/opt/conda/envs/rapids/lib/python3.7/site-packages/cudf/io/parquet.py in read_parquet(filepath_or_buffer, engine, columns, filters, row_groups, skiprows, num_rows, strings_to_categorical, use_pandas_metadata, *args, **kwargs)
248 num_rows=num_rows,
249 strings_to_categorical=strings_to_categorical,
--> 250 use_pandas_metadata=use_pandas_metadata,
251 )
252 else:
cudf/_lib/parquet.pyx in cudf._lib.parquet.read_parquet()
cudf/_lib/parquet.pyx in cudf._lib.parquet.read_parquet()
**RuntimeError: cuDF failure at: /opt/conda/envs/rapids/conda-bld/libcudf_1603354682602/work/cpp/src/io/parquet/reader_impl.cu:229: Corrupted header or footer**
Steps/Code to reproduce bug Here is folder with a minimalist notebook and two small parquet files to reproduce the issue (internal access only for the NVT team)
Expected behavior NVT should be able to load a dataset whose parquet files share the same dtypes, even if the columns are not null for some files and nullable for the others
Environment details (please complete the following information): nvtabular in the main branch (future 0.3) cudf==0.16 dask_cudf==0.16 pyarrow==1.0.1
Additional context Add any other context about the problem here.
Issue Analytics
- State:
- Created 3 years ago
- Comments:10 (7 by maintainers)
Top GitHub Comments
+1 I’ve seen this issue many times, but only with loading NVTabular’s parquet output. If we’ve got some tools to inspect and ensure data sanity (like the script from @rjzamora), it would be good to document and share in a “best practice” kinda document somewhere.
@gabrielspmoreira @rjzamora - closing this , re-open if there is still an issue