Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

BUG: `DataFrame.to_parquet` doesn't round-trip pyarrow StringDtype

See original GitHub issue

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
(optional) I have confirmed this bug exists on the master branch of pandas.

Code Sample, a copy-pastable example

import pandas as pd
a = pd.DataFrame({"A": pd.array(['a', 'b'], dtype=pd.StringDtype("pyarrow"))})
a.to_parquet("test.parquet")
b = pd.read_parquet("test.parquet")
pd.testing.assert_frame_equal(a, b)

---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
/tmp/ipykernel_493/3001616580.py in <module>
      3 a.to_parquet("test.parquet")
      4 b = pd.read_parquet("test.parquet")
----> 5 pd.testing.assert_frame_equal(a, b)

    [... skipping hidden 3 frame]

/srv/conda/envs/notebook/lib/python3.8/site-packages/pandas/_testing/asserters.py in raise_assert_detail(obj, message, left, right, diff, index_values)
    663         msg += f"\n[diff]: {diff}"
    664 
--> 665     raise AssertionError(msg)
    666 
    667 

AssertionError: Attributes of DataFrame.iloc[:, 0] (column name="A") are different

Attribute "dtype" are different
[left]:  string[pyarrow]
[right]: string[python]

Problem description

read_parquet currently loads all string dtype as string[python]. We’d ideally match what was written.

Expected Output

A DataFrame with string[pyarrow] rather than string[python]

Output of `pd.show_versions()`

INSTALLED VERSIONS
------------------
commit           : f00ed8f47020034e752baf0250483053340971b0
python           : 3.8.10.final.0
python-bits      : 64
OS               : Linux
OS-release       : 5.4.0-1040-azure
Version          : #42~18.04.1-Ubuntu SMP Mon Feb 8 19:05:32 UTC 2021
machine          : x86_64
processor        : x86_64
byteorder        : little
LC_ALL           : C.UTF-8
LANG             : C.UTF-8
LOCALE           : en_US.UTF-8

pandas           : 1.3.0
numpy            : 1.21.0
pytz             : 2021.1
dateutil         : 2.7.5
pip              : 20.3.4
setuptools       : 49.6.0.post20210108
Cython           : None
pytest           : None
hypothesis       : None
sphinx           : None
blosc            : 1.10.2
feather          : None
xlsxwriter       : None
lxml.etree       : None
html5lib         : None
pymysql          : None
psycopg2         : 2.9.1 (dt dec pq3 ext lo64)
jinja2           : 3.0.1
IPython          : 7.25.0
pandas_datareader: None
bs4              : 4.9.3
bottleneck       : 1.3.2
fsspec           : 2021.06.1
fastparquet      : None
gcsfs            : 2021.06.1
matplotlib       : 3.4.2
numexpr          : None
odfpy            : None
openpyxl         : None
pandas_gbq       : None
pyarrow          : 4.0.0
pyxlsb           : None
s3fs             : 2021.06.1
scipy            : 1.7.0
sqlalchemy       : 1.4.20
tables           : None
tabulate         : 0.8.9
xarray           : 0.18.2
xlrd             : None
xlwt             : None
numba            : 0.53.1
```

</details>

Issue Analytics

State:
Created 2 years ago
Comments:8 (5 by maintainers)

Top GitHub Comments

1reaction

mzeitlin11commented, Aug 10, 2021

Yep, that would be great @jeremyswerdlow!

0reactions

randolf-scholzcommented, Sep 9, 2022

This would be really nice, as the memory difference can be huge. Got tripped up when trying to load a table stored by pyarrow that would take 16G when using string[pyarrow], but > 60G using regular string[python].

Top Results From Across the Web

I can't convert df to parquet by data type error - Stack Overflow

I'm trying to convert a pandas dataframe to parquet, but I'm getting an error "Exptected bytes, got a 'int' object", 'Conversion failed for ......

Pandas Integration — Apache Arrow v10.0.1

To interface with pandas, PyArrow provides various conversion routines to consume ... This roundtrip conversion works because metadata about the original ...

Writing Parquet Files in Python with Pandas, PySpark, and ...

This blog post shows how to convert a CSV file to Parquet with Pandas, Spark, PyArrow and Dask. It discusses the pros and...

pandas.DataFrame.to_parquet

The default io.parquet.engine behavior is to try 'pyarrow', falling back to ... as a range in the metadata so it doesn't require much...

dask.dataframe.to_parquet - Dask documentation

Store Dask.dataframe to Parquet files. Parameters ... Defaults to 'auto', which uses pyarrow if it is installed, and falls back to fastparquet otherwise....