question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Cannot read pyarrow RangeIndex

See original GitHub issue
df = pd.DataFrame([1,2,3], columns=['a'])
df.to_parquet('tmp.parquet', engine='pyarrow')
pd.read_parquet('tmp.parquet', engine='fastparquet')

Raises the exception

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-42-d993694086f8> in <module>
      1 df = pd.DataFrame([1,2,3], columns=['a'])
      2 df.to_parquet('tmp.parquet', engine='pyarrow')
----> 3 pd.read_parquet('tmp.parquet', engine='fastparquet')

~/.pyenv/versions/3.6.2/envs/general/lib/python3.6/site-packages/pandas/io/parquet.py in read_parquet(path, engine, columns, **kwargs)
    280 
    281     impl = get_engine(engine)
--> 282     return impl.read(path, columns=columns, **kwargs)

~/.pyenv/versions/3.6.2/envs/general/lib/python3.6/site-packages/pandas/io/parquet.py in read(self, path, columns, **kwargs)
    209             parquet_file = self.api.ParquetFile(path)
    210 
--> 211         return parquet_file.to_pandas(columns=columns, **kwargs)
    212 
    213 

~/.pyenv/versions/3.6.2/envs/general/lib/python3.6/site-packages/fastparquet/api.py in to_pandas(self, columns, categories, filters, index)
    419         if index:
    420             columns += [i for i in index if i not in columns]
--> 421         check_column_names(self.columns + list(self.cats), columns, categories)
    422         df, views = self.pre_allocate(size, columns, categories, index)
    423         start = 0

~/.pyenv/versions/3.6.2/envs/general/lib/python3.6/site-packages/fastparquet/util.py in check_column_names(columns, *args)
     90     for arg in args:
     91         if isinstance(arg, (tuple, list)):
---> 92             if set(arg) - set(columns):
     93                 raise ValueError("Column name not in list.\n"
     94                                  "Requested %s\n"

TypeError: unhashable type: 'dict'

This is most likely the result of: https://github.com/pandas-dev/pandas/issues/25672 and https://github.com/apache/arrow/pull/3868

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Reactions:3
  • Comments:5 (3 by maintainers)

github_iconTop GitHub Comments

1reaction
martindurantcommented, Jun 5, 2019

Exactly what you get will now depend on which version of pyarrow you used, as well as of fastparquet. In the past (<0.13), pyarrow would write real columns of data for the index, with names like the cryptic one you show. When you load with fastparquet and say “I don’t want to set an index”, it becomes an ordinary column. If you do allow it to be set as an index, the name should be reconstituted to None. You could just use columns= to ignore it completely.

In the most recent version of pyarrow, there would be no column data, but a range index metadata marker instead. It takes up no space, and there is no reason not to have it populate the index. In this case, if you said you wanted to ignore the index, or use another, the range should be ignored.

0reactions
martindurantcommented, Jun 30, 2019

Are you saying that current fastparquet can’t read older pyarrow-written data? That would indeed be a problem.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Pandas Integration — Apache Arrow v10.0.1
In [1]: import pandas as pd In [2]: import pyarrow as pa ... Since storing a RangeIndex can cause issues in some limited...
Read more >
Converting .parquet file to CSV using Pyarrow - Stack Overflow
Try following: import pyarrow as pa import pyarrow.parquet as pq import pandas as pd import pyodbc def read_pyarrow(path, nthreads=1): ...
Read more >
What's new in 1.5.0 (September 19, 2022) - Pandas
In [1]: import pyarrow as pa In [2]: ser_float = pd.Series([1.0, 2.0, None], ... If the compression method cannot be inferred, use the...
Read more >
Apache Arrow 2.0.0 (2020-10-13)
__getitem__ doesn't work with numpy scalars; ARROW-9882 - [C++/Python] Update conda-forge-pinning to 3 for OSX conda packages ...
Read more >
pyarrow.Table — Apache Arrow v3.0.0 - enpiar.com
Schema, optional) – The expected schema of the Arrow Table. This can be used to indicate the type of columns if we cannot...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found