Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[BUG] "Schemas are inconsistent" error for parquet files which have the same dtype, but are different in not null setting

See original GitHub issue

Describe the bug This bug did not ocurr with NVT 0.2, but now occurs with the main branch (future NVT 0.3). It us raised the error “Schemas are inconsistent” when the parquet files in the same folder share the same columns and dtypes, but there are null values <NA> for some column in one of the parquet files, and not in the corresponding column of another parquet file. But it is raised an error when an NVT dataset is instantiated and we try to head() its first elements, like

ds = nvt.Dataset(PATH, engine="parquet", part_size="1000MB")
ds.to_ddf().head()

The error raised then is:

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
/opt/conda/envs/rapids/lib/python3.7/site-packages/dask/dataframe/io/parquet/arrow.py in _append_row_groups(metadata, md)
     33     try:
---> 34         metadata.append_row_groups(md)
     35     except RuntimeError as err:

/opt/conda/envs/rapids/lib/python3.7/site-packages/pyarrow/_parquet.pyx in pyarrow._parquet.FileMetaData.append_row_groups()

RuntimeError: AppendRowGroups requires equal schemas.

The above exception was the direct cause of the following exception:

RuntimeError                              Traceback (most recent call last)
<ipython-input-4-b3788a1214d6> in <module>
----> 1 ds.to_ddf().head()

/nvtabular0.3/NVTabular/nvtabular/io/dataset.py in to_ddf(self, columns, shuffle, seed)
    263         """
    264         # Use DatasetEngine to create ddf
--> 265         ddf = self.engine.to_ddf(columns=columns)
    266 
    267         # Shuffle the partitions of ddf (optional)

/nvtabular0.3/NVTabular/nvtabular/io/parquet.py in to_ddf(self, columns)
    103             gather_statistics=False,
    104             split_row_groups=self.row_groups_per_part,
--> 105             storage_options=self.storage_options,
    106         )
    107 

/opt/conda/envs/rapids/lib/python3.7/site-packages/dask_cudf/io/parquet.py in read_parquet(path, columns, split_row_groups, row_groups_per_part, **kwargs)
    192         split_row_groups=split_row_groups,
    193         engine=CudfEngine,
--> 194         **kwargs,
    195     )
    196 

/opt/conda/envs/rapids/lib/python3.7/site-packages/dask/dataframe/io/parquet/core.py in read_parquet(path, columns, filters, categories, index, storage_options, engine, gather_statistics, split_row_groups, chunksize, **kwargs)
    237         filters=filters,
    238         split_row_groups=split_row_groups,
--> 239         **kwargs,
    240     )
    241 

/opt/conda/envs/rapids/lib/python3.7/site-packages/dask_cudf/io/parquet.py in read_metadata(*args, **kwargs)
     15     @staticmethod
     16     def read_metadata(*args, **kwargs):
---> 17         meta, stats, parts, index = ArrowEngine.read_metadata(*args, **kwargs)
     18 
     19         # If `strings_to_categorical==True`, convert objects to int32

/opt/conda/envs/rapids/lib/python3.7/site-packages/dask/dataframe/io/parquet/arrow.py in read_metadata(cls, fs, paths, categories, index, gather_statistics, filters, split_row_groups, **kwargs)
    654             gather_statistics,
    655         ) = _gather_metadata(
--> 656             paths, fs, split_row_groups, gather_statistics, filters, dataset_kwargs
    657         )
    658 

/opt/conda/envs/rapids/lib/python3.7/site-packages/dask/dataframe/io/parquet/arrow.py in _gather_metadata(paths, fs, split_row_groups, gather_statistics, filters, dataset_kwargs)
    246                 md.set_file_path(fn)
    247             if metadata:
--> 248                 _append_row_groups(metadata, md)
    249             else:
    250                 metadata = md

/opt/conda/envs/rapids/lib/python3.7/site-packages/dask/dataframe/io/parquet/arrow.py in _append_row_groups(metadata, md)
     40                 "pyarrow schema. Such as "
     41                 '`to_parquet(..., schema={"column1": pa.string()})`'
---> 42             ) from err
     43         else:
     44             raise err

RuntimeError: Schemas are inconsistent, try using `to_parquet(..., schema="infer")`, or pass an explicit pyarrow schema. Such as `to_parquet(..., schema={"column1": pa.string()})`

By using this script from @rjzamora , it was possible to check that the metadata for the parquet files differs because columns are not null for one file and nullable for the other, that contains nulls.

ValueError: Schema in /gfn-merlin/gmoreira/data/debug/gfn_problematic_columns//2020-08-30.parquet was different. 
DayOfWeekUTC: string not null
  -- field metadata --
  PARQUET:field_id: '1'
MonthUTC: string not null
  -- field metadata --
  PARQUET:field_id: '2'
HourOfDayUTC: float not null
  -- field metadata --
  PARQUET:field_id: '3'
WeekNumberUTC: float not null
  -- field metadata --
  PARQUET:field_id: '4'
-- schema metadata --
pandas: '{"index_columns": [], "column_indexes": [{"name": null, "field_n' + 685

vs

DayOfWeekUTC: string
  -- field metadata --
  PARQUET:field_id: '1'
MonthUTC: string
  -- field metadata --
  PARQUET:field_id: '2'
HourOfDayUTC: float
  -- field metadata --
  PARQUET:field_id: '3'
WeekNumberUTC: float
  -- field metadata --
  PARQUET:field_id: '4'
-- schema metadata --
pandas: '{"index_columns": [], "column_indexes": [{"name": null, "field_n' + 685

BTW, the two parquet files can be loaded individually using dask_cudf. But when they are loaded together (e.g. pointing to a directory with the two files)

df = dask_cudf.read_parquet(PATH).compute()

the following error is raised by dask_cudf

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-8-a4ba9b6fbe05> in <module>
----> 1 df2 = df.compute()

/opt/conda/envs/rapids/lib/python3.7/site-packages/dask/base.py in compute(self, **kwargs)
    165         dask.base.compute
    166         """
--> 167         (result,) = compute(self, traverse=False, **kwargs)
    168         return result
    169 

/opt/conda/envs/rapids/lib/python3.7/site-packages/dask/base.py in compute(*args, **kwargs)
    450         postcomputes.append(x.__dask_postcompute__())
    451 
--> 452     results = schedule(dsk, keys, **kwargs)
    453     return repack([f(r, *a) for r, (f, a) in zip(results, postcomputes)])
    454 

/opt/conda/envs/rapids/lib/python3.7/site-packages/dask/local.py in get_sync(dsk, keys, **kwargs)
    525     """
    526     kwargs.pop("num_workers", None)  # if num_workers present, remove it
--> 527     return get_async(apply_sync, 1, dsk, keys, **kwargs)
    528 
    529 

/opt/conda/envs/rapids/lib/python3.7/site-packages/dask/local.py in get_async(apply_async, num_workers, dsk, result, cache, get_id, rerun_exceptions_locally, pack_exception, raise_exception, callbacks, dumps, loads, **kwargs)
    492 
    493                 while state["ready"] and len(state["running"]) < num_workers:
--> 494                     fire_task()
    495 
    496             succeeded = True

/opt/conda/envs/rapids/lib/python3.7/site-packages/dask/local.py in fire_task()
    464                         pack_exception,
    465                     ),
--> 466                     callback=queue.put,
    467                 )
    468 

/opt/conda/envs/rapids/lib/python3.7/site-packages/dask/local.py in apply_sync(func, args, kwds, callback)
    514 def apply_sync(func, args=(), kwds={}, callback=None):
    515     """ A naive synchronous version of apply_async """
--> 516     res = func(*args, **kwds)
    517     if callback is not None:
    518         callback(res)

/opt/conda/envs/rapids/lib/python3.7/site-packages/dask/local.py in execute_task(key, task_info, dumps, loads, get_id, pack_exception)
    225         failed = False
    226     except BaseException as e:
--> 227         result = pack_exception(e, dumps)
    228         failed = True
    229     return key, result, failed

/opt/conda/envs/rapids/lib/python3.7/site-packages/dask/local.py in execute_task(key, task_info, dumps, loads, get_id, pack_exception)
    220     try:
    221         task, data = loads(task_info)
--> 222         result = _execute_task(task, data)
    223         id = get_id()
    224         result = dumps((result, id))

/opt/conda/envs/rapids/lib/python3.7/site-packages/dask/core.py in _execute_task(arg, cache, dsk)
    119         # temporaries by their reference count and can execute certain
    120         # operations in-place.
--> 121         return func(*(_execute_task(a, cache) for a in args))
    122     elif not ishashable(arg):
    123         return arg

/opt/conda/envs/rapids/lib/python3.7/site-packages/dask/dataframe/io/parquet/core.py in read_parquet_part(func, fs, meta, part, columns, index, kwargs)
    274     This function is used by `read_parquet`."""
    275     if isinstance(part, list):
--> 276         dfs = [func(fs, rg, columns.copy(), index, **kwargs) for rg in part]
    277         df = concat(dfs, axis=0)
    278     else:

/opt/conda/envs/rapids/lib/python3.7/site-packages/dask/dataframe/io/parquet/core.py in <listcomp>(.0)
    274     This function is used by `read_parquet`."""
    275     if isinstance(part, list):
--> 276         dfs = [func(fs, rg, columns.copy(), index, **kwargs) for rg in part]
    277         df = concat(dfs, axis=0)
    278     else:

/opt/conda/envs/rapids/lib/python3.7/site-packages/dask_cudf/io/parquet.py in read_partition(fs, piece, columns, index, categories, partitions, **kwargs)
     55                 row_groups=row_group,
     56                 strings_to_categorical=strings_to_cats,
---> 57                 **kwargs.get("read", {}),
     58             )
     59         else:

/opt/conda/envs/rapids/lib/python3.7/site-packages/cudf/io/parquet.py in read_parquet(filepath_or_buffer, engine, columns, filters, row_groups, skiprows, num_rows, strings_to_categorical, use_pandas_metadata, *args, **kwargs)
    248             num_rows=num_rows,
    249             strings_to_categorical=strings_to_categorical,
--> 250             use_pandas_metadata=use_pandas_metadata,
    251         )
    252     else:

cudf/_lib/parquet.pyx in cudf._lib.parquet.read_parquet()

cudf/_lib/parquet.pyx in cudf._lib.parquet.read_parquet()

**RuntimeError: cuDF failure at: /opt/conda/envs/rapids/conda-bld/libcudf_1603354682602/work/cpp/src/io/parquet/reader_impl.cu:229: Corrupted header or footer**

Steps/Code to reproduce bug Here is folder with a minimalist notebook and two small parquet files to reproduce the issue (internal access only for the NVT team)

Expected behavior NVT should be able to load a dataset whose parquet files share the same dtypes, even if the columns are not null for some files and nullable for the others

Environment details (please complete the following information): nvtabular in the main branch (future 0.3) cudf==0.16 dask_cudf==0.16 pyarrow==1.0.1

Additional context Add any other context about the problem here.

Issue Analytics

State:
Created 3 years ago
Comments:10 (7 by maintainers)

Top GitHub Comments

1reaction

vinhngxcommented, Nov 12, 2020

+1 I’ve seen this issue many times, but only with loading NVTabular’s parquet output. If we’ve got some tools to inspect and ensure data sanity (like the script from @rjzamora), it would be good to document and share in a “best practice” kinda document somewhere.

0reactions

benfredcommented, Oct 6, 2021

@gabrielspmoreira @rjzamora - closing this , re-open if there is still an issue

Top Results From Across the Web

How to fix inconsistent schemas in parquet file partition using ...

I have a ton of partitioned files and going through each one to find if the schema is the same and fixing each...

What's New — pandas 0.23.4 documentation - PyData |

Bug in pandas.merge() returning the wrong result when joining on an unordered Categorical that had the same categories but in a different order...

Troubleshooting — NVTabular 2021 documentation

NVTabular expects that all input parquet files have the same schema, which includes column types and the nullable (not null) option.

Release Notes — Woodwork 0.20.0 documentation - Alteryx

Uses string[pyarrow] instead of string dtype to save memory (GH#1360). Added a better error message when dataframe and schema have different columns ...

Spark Read Parquet With Different Schema - danzaecultura.it

How to merge two parquet files having different schema in spark (java) I am having 2 ... But this will give inconsistency records...