question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

to_parquet() method fails even though column names are all strings

See original GitHub issue

Problem description

While attempting to serialize a pandas data frame with the to_parquet() method, I got an error message stating that the column names were not strings, even though they seem to be.

Code Example

I have a pandas data frame with the following columns:

In [1]: region_measurements.columns
Out [1]: Index([       u'measurement_id',                u'aoi_id',
                  u'created_ts',                  u'hash',
       u'algorithm_instance_id',                 u'state',
                  u'updated_ts',        u'aoi_version_id',
                u'ingestion_ts',            u'imaging_ts',
                    u'scene_id',             u'score_map',
            u'confidence_score',              u'fill_pct',
                    u'local_ts',        u'is_upper_bound',
             u'aoi_cloud_cover',      u'valid_pixel_frac'],
      dtype='object')

Seemingly, all of the column names are strings. The cells of the dataframe contain mixed information with a JSON blob in some of them. (I’ve pasted a row of the frame in the <details> tag.

First row of the DF:
measurement_id | aoi_id | created_ts | hash | algorithm_instance_id | state | updated_ts | aoi_version_id | ingestion_ts | imaging_ts | scene_id | score_map | confidence_score | fill_pct | local_ts | is_upper_bound | aoi_cloud_cover | valid_pixel_frac

5a1815e9-75f2-4954-bcd4-7e8835a65a22 | 01ea1a2f-fb66-4243-aa87-c6cf206652f7 | 2018-12-11 02:20:57.507975+00:00 | 70b8e206fb7ee0030b7fe54bf5dfe54d | fillpct_cnn_attn_rsat2_spl_v1.1.0 | COMPLETE | 2018-12-11 02:20:59.642500+00:00 | 12512 | 2018-11-03 21:37:10.210798+00:00 | 2018-11-03 00:36:18.575107+00:00 | RS2_OK103191_PK897635_DK833095_SLA27_20181103_... | None | 1.0 | 17.161395 | 2018-11-02 19:36:18 | False | 0.000048 | NaN

image

When I attempt to serialize this dataframe I get the following error:

In [2]: region_measurements.to_parquet('tmp.par')
Out [2]: ---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-69-082c78243d16> in <module>()
----> 1 region_measurements.to_parquet('tmp.par')

/usr/local/lib/python2.7/site-packages/pandas/core/frame.pyc in to_parquet(self, fname, engine, compression, **kwargs)
   1647         from pandas.io.parquet import to_parquet
   1648         to_parquet(self, fname, engine,
-> 1649                    compression=compression, **kwargs)
   1650 
   1651     @Substitution(header='Write out the column names. If a list of strings '

/usr/local/lib/python2.7/site-packages/pandas/io/parquet.pyc in to_parquet(df, path, engine, compression, **kwargs)
    225     """
    226     impl = get_engine(engine)
--> 227     return impl.write(df, path, compression=compression, **kwargs)
    228 
    229 

/usr/local/lib/python2.7/site-packages/pandas/io/parquet.pyc in write(self, df, path, compression, coerce_timestamps, **kwargs)
    105     def write(self, df, path, compression='snappy',
    106               coerce_timestamps='ms', **kwargs):
--> 107         self.validate_dataframe(df)
    108         if self._pyarrow_lt_070:
    109             self._validate_write_lt_070(df)

/usr/local/lib/python2.7/site-packages/pandas/io/parquet.pyc in validate_dataframe(df)
     53         # must have value column names (strings only)
     54         if df.columns.inferred_type not in {'string', 'unicode'}:
---> 55             raise ValueError("parquet must have string column names")
     56 
     57         # index level names must be strings

ValueError: parquet must have string column names

As far as I can tell, the column names are all strings. I did try to reset_index() and save the resultant data frame, but I got the same error.

Expected Output

Its possible that the frame can not be serialized to parquet for some other reason, but the error message in this case seems to be misleading. Or there is a trick that i’m missing.

I’d be grateful for any help in resolving this!

Output of pd.show_versions()

``` INSTALLED VERSIONS ------------------ commit: None python: 2.7.12.final.0 python-bits: 64 OS: Linux OS-release: 3.10.0-693.2.2.el7.x86_64 machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: None.None

pandas: 0.22.0 pytest: 3.3.1 pip: 18.1 setuptools: 39.2.0 Cython: 0.25.2 numpy: 1.14.2 scipy: 0.19.0 pyarrow: 0.9.0 xarray: None IPython: 5.7.0 sphinx: None patsy: 0.4.1 dateutil: 2.7.5 pytz: 2017.3 blosc: None bottleneck: None tables: 3.4.4 numexpr: 2.6.8 feather: None matplotlib: 2.0.0 openpyxl: None xlrd: 0.9.4 xlwt: None xlsxwriter: 0.7.6 lxml: None bs4: 4.6.3 html5lib: 0.9999999 sqlalchemy: 1.0.5 pymysql: None psycopg2: 2.7 (dt dec pq3 ext lo64) jinja2: 2.10 s3fs: 0.2.0 fastparquet: 0.1.5 pandas_gbq: None pandas_datareader: None


</details>

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Reactions:2
  • Comments:6 (2 by maintainers)

github_iconTop GitHub Comments

22reactions
purezhanghancommented, Nov 29, 2019

You can try to force the type of column to be string. It works for me. df.columns = df.columns.astype(str) https://github.com/dask/fastparquet/issues/41

9reactions
gosuto-inzasherucommented, Feb 3, 2021

What would be against embedding df.columns = df.columns.astype(str) into the .to_parquet method?

Read more comments on GitHub >

github_iconTop Results From Across the Web

Spark Dataframe validating column names for parquet writes
schema you see it has no reference to the original column names, so when reading it fails to find the columns, and hence...
Read more >
dask.dataframe.to_parquet - Dask documentation
The schema for a subset of columns can be overridden by passing in a dict of column names to pyarrow types (for example...
Read more >
Troubleshoot the Parquet format connector - Azure Data ...
Learn how to troubleshoot issues with the Parquet format connector in Azure Data Factory and Azure Synapse Analytics.
Read more >
Solved: modify postgre SQL table schema - Dataiku Community
Is there a way to get the real schema (that DSS can somehow see, as the column headers are correct in the Explore...
Read more >
Solved: Spark 2 Can't write dataframe to parquet table
This error does not happen with the insertInto() command though. This is not a good work around because saveAsTable() checks column names wheras ......
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found