BUG: New param [use_nullable_dtypes] of pd.read_parquet() can't handle empty parquet file
See original GitHub issue-
I have checked that this issue has not already been reported.
-
I have confirmed this bug exists on the latest version of pandas.
-
(optional) I have confirmed this bug exists on the master branch of pandas.
Code Sample, a copy-pastable example
df_pq = pd.read_parquet(x, use_nullable_dtypes = True)
Problem description
Get error when add the new parameter use_nullable_dtypes to pd.read_parquet(). If remove it , everything go back to normal. OS: Ubuntu 16 Python: 3.8
A empty parquet file from spark causes the problem. Its schema is:
Authors,AuthorId,int64 Authors,Rank,int32 Authors,NormalizedName,string Authors,DisplayName,string Authors,LastKnownAffiliationId,int64 Authors,PaperCount,int64 Authors,PaperFamilyCount,int64 Authors,CitationCount,int64 Authors,CreatedDate,date32[day]
error msg:
df_pq = pd.read_parquet(x,use_nullable_dtypes = True)
File “/vjan/lib/python3.8/site-packages/pandas/io/parquet.py”, line 459, in read_parquet return impl.read( File “/vjan/lib/python3.8/site-packages/pandas/io/parquet.py”, line 221, in read return self.api.parquet.read_table( File “pyarrow/array.pxi”, line 751, in pyarrow.lib._PandasConvertible.to_pandas File “pyarrow/table.pxi”, line 1668, in pyarrow.lib.Table._to_pandas File “/vjan/lib/python3.8/site-packages/pyarrow/pandas_compat.py”, line 792, in table_to_blockmanager blocks = _table_to_blocks(options, table, categories, ext_columns_dtypes) File “/vjan/lib/python3.8/site-packages/pyarrow/pandas_compat.py”, line 1133, in _table_to_blocks return [_reconstruct_block(item, columns, extension_columns) File “/vjan/lib/python3.8/site-packages/pyarrow/pandas_compat.py”, line 1133, in <listcomp> return [_reconstruct_block(item, columns, extension_columns) File “/vjan/lib/python3.8/site-packages/pyarrow/pandas_compat.py”, line 751, in _reconstruct_block pd_ext_arr = pandas_dtype.from_arrow(arr) File “/vjan/lib/python3.8/site-packages/pandas/core/arrays/integer.py”, line 121, in from_arrow return IntegerArray._concat_same_type(results) File “/vjan/lib/python3.8/site-packages/pandas/core/arrays/masked.py”, line 271, in _concat_same_type data = np.concatenate([x._data for x in to_concat]) File “<array_function internals>”, line 5, in concatenate ValueError: need at least one array to concatenate
Expected Output
read the empty parquet file and generate an empty df
Output of pd.show_versions()
1.2.4
Issue Analytics
- State:
- Created 2 years ago
- Comments:7 (6 by maintainers)
take
Thanks @nakatomotoi. pandas has a test suite that is run on ci when a PR is opened. This issue requires a test to be added to the test suite so that we can close the issue knowing that future similar regressions should be less likely.
see https://github.com/pandas-dev/pandas/issues?q=is%3Aissue+is%3Aclosed+label%3A"Needs+Tests" for issues like this that have been closed and check out the associated PRs for insipiration.
The developer guide is https://pandas.pydata.org/pandas-docs/dev/development/index.html