question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

BUG: New param [use_nullable_dtypes] of pd.read_parquet() can't handle empty parquet file

See original GitHub issue
  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • (optional) I have confirmed this bug exists on the master branch of pandas.


Code Sample, a copy-pastable example


df_pq = pd.read_parquet(x, use_nullable_dtypes = True)

Problem description

Get error when add the new parameter use_nullable_dtypes to pd.read_parquet(). If remove it , everything go back to normal. OS: Ubuntu 16 Python: 3.8

A empty parquet file from spark causes the problem. Its schema is:

Authors,AuthorId,int64 Authors,Rank,int32 Authors,NormalizedName,string Authors,DisplayName,string Authors,LastKnownAffiliationId,int64 Authors,PaperCount,int64 Authors,PaperFamilyCount,int64 Authors,CitationCount,int64 Authors,CreatedDate,date32[day]

error msg:

df_pq = pd.read_parquet(x,use_nullable_dtypes = True)

File “/vjan/lib/python3.8/site-packages/pandas/io/parquet.py”, line 459, in read_parquet return impl.read( File “/vjan/lib/python3.8/site-packages/pandas/io/parquet.py”, line 221, in read return self.api.parquet.read_table( File “pyarrow/array.pxi”, line 751, in pyarrow.lib._PandasConvertible.to_pandas File “pyarrow/table.pxi”, line 1668, in pyarrow.lib.Table._to_pandas File “/vjan/lib/python3.8/site-packages/pyarrow/pandas_compat.py”, line 792, in table_to_blockmanager blocks = _table_to_blocks(options, table, categories, ext_columns_dtypes) File “/vjan/lib/python3.8/site-packages/pyarrow/pandas_compat.py”, line 1133, in _table_to_blocks return [_reconstruct_block(item, columns, extension_columns) File “/vjan/lib/python3.8/site-packages/pyarrow/pandas_compat.py”, line 1133, in <listcomp> return [_reconstruct_block(item, columns, extension_columns) File “/vjan/lib/python3.8/site-packages/pyarrow/pandas_compat.py”, line 751, in _reconstruct_block pd_ext_arr = pandas_dtype.from_arrow(arr) File “/vjan/lib/python3.8/site-packages/pandas/core/arrays/integer.py”, line 121, in from_arrow return IntegerArray._concat_same_type(results) File “/vjan/lib/python3.8/site-packages/pandas/core/arrays/masked.py”, line 271, in _concat_same_type data = np.concatenate([x._data for x in to_concat]) File “<array_function internals>”, line 5, in concatenate ValueError: need at least one array to concatenate

Expected Output

read the empty parquet file and generate an empty df

Output of pd.show_versions()

1.2.4

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:7 (6 by maintainers)

github_iconTop GitHub Comments

1reaction
nakatomotoicommented, Sep 1, 2021

take

1reaction
simonjayhawkinscommented, Aug 25, 2021

Thanks @nakatomotoi. pandas has a test suite that is run on ci when a PR is opened. This issue requires a test to be added to the test suite so that we can close the issue knowing that future similar regressions should be less likely.

see https://github.com/pandas-dev/pandas/issues?q=is%3Aissue+is%3Aclosed+label%3A"Needs+Tests" for issues like this that have been closed and check out the associated PRs for insipiration.

The developer guide is https://pandas.pydata.org/pandas-docs/dev/development/index.html

Read more comments on GitHub >

github_iconTop Results From Across the Web

Error reading empty parquet file as pandas DataFrame
I found a solution to handle that but is there a more elegant way than that? df = df.loc[[]] # instead of df.loc[[],...
Read more >
pandas.read_parquet — pandas 1.5.2 documentation
Load a parquet object from the file path, returning a DataFrame. Parameters ... PathLike[str] ), or file-like object implementing a binary read() function....
Read more >
Parquet Files - Spark 3.3.1 Documentation
Spark SQL provides support for both reading and writing Parquet files that automatically preserves the schema of the original data.
Read more >
Solved: Spark 2 Can't write dataframe to parquet table
Solved: I'm trying to write a dataframe to a parquet hive table and keep getting an error saying that the table - 61712....
Read more >
Using the Parquet format in AWS Glue
You can use AWS Glue to read Parquet files from Amazon S3 and from streaming sources as well as write Parquet files to...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found