Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Cannot read pyarrow RangeIndex

See original GitHub issue

df = pd.DataFrame([1,2,3], columns=['a'])
df.to_parquet('tmp.parquet', engine='pyarrow')
pd.read_parquet('tmp.parquet', engine='fastparquet')

Raises the exception

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-42-d993694086f8> in <module>
      1 df = pd.DataFrame([1,2,3], columns=['a'])
      2 df.to_parquet('tmp.parquet', engine='pyarrow')
----> 3 pd.read_parquet('tmp.parquet', engine='fastparquet')

~/.pyenv/versions/3.6.2/envs/general/lib/python3.6/site-packages/pandas/io/parquet.py in read_parquet(path, engine, columns, **kwargs)
    280 
    281     impl = get_engine(engine)
--> 282     return impl.read(path, columns=columns, **kwargs)

~/.pyenv/versions/3.6.2/envs/general/lib/python3.6/site-packages/pandas/io/parquet.py in read(self, path, columns, **kwargs)
    209             parquet_file = self.api.ParquetFile(path)
    210 
--> 211         return parquet_file.to_pandas(columns=columns, **kwargs)
    212 
    213 

~/.pyenv/versions/3.6.2/envs/general/lib/python3.6/site-packages/fastparquet/api.py in to_pandas(self, columns, categories, filters, index)
    419         if index:
    420             columns += [i for i in index if i not in columns]
--> 421         check_column_names(self.columns + list(self.cats), columns, categories)
    422         df, views = self.pre_allocate(size, columns, categories, index)
    423         start = 0

~/.pyenv/versions/3.6.2/envs/general/lib/python3.6/site-packages/fastparquet/util.py in check_column_names(columns, *args)
     90     for arg in args:
     91         if isinstance(arg, (tuple, list)):
---> 92             if set(arg) - set(columns):
     93                 raise ValueError("Column name not in list.\n"
     94                                  "Requested %s\n"

TypeError: unhashable type: 'dict'

This is most likely the result of: https://github.com/pandas-dev/pandas/issues/25672 and https://github.com/apache/arrow/pull/3868

Issue Analytics

State:
Created 4 years ago
Reactions:3
Comments:5 (3 by maintainers)

Top GitHub Comments

1reaction

martindurantcommented, Jun 5, 2019

Exactly what you get will now depend on which version of pyarrow you used, as well as of fastparquet. In the past (<0.13), pyarrow would write real columns of data for the index, with names like the cryptic one you show. When you load with fastparquet and say “I don’t want to set an index”, it becomes an ordinary column. If you do allow it to be set as an index, the name should be reconstituted to None. You could just use columns= to ignore it completely.

In the most recent version of pyarrow, there would be no column data, but a range index metadata marker instead. It takes up no space, and there is no reason not to have it populate the index. In this case, if you said you wanted to ignore the index, or use another, the range should be ignored.

0reactions

martindurantcommented, Jun 30, 2019

Are you saying that current fastparquet can’t read older pyarrow-written data? That would indeed be a problem.

Top Results From Across the Web

Pandas Integration — Apache Arrow v10.0.1

In [1]: import pandas as pd In [2]: import pyarrow as pa ... Since storing a RangeIndex can cause issues in some limited...

Converting .parquet file to CSV using Pyarrow - Stack Overflow

Try following: import pyarrow as pa import pyarrow.parquet as pq import pandas as pd import pyodbc def read_pyarrow(path, nthreads=1): ...

What's new in 1.5.0 (September 19, 2022) - Pandas

In [1]: import pyarrow as pa In [2]: ser_float = pd.Series([1.0, 2.0, None], ... If the compression method cannot be inferred, use the...

Apache Arrow 2.0.0 (2020-10-13)

__getitem__ doesn't work with numpy scalars; ARROW-9882 - [C++/Python] Update conda-forge-pinning to 3 for OSX conda packages ...

pyarrow.Table — Apache Arrow v3.0.0 - enpiar.com

Schema, optional) – The expected schema of the Arrow Table. This can be used to indicate the type of columns if we cannot...

Troubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.

Start Free

Top Related Reddit Thread

No results found

Top Related Tweet

No results found

Top Related Dev.to Post

No results found

Cannot read pyarrow RangeIndex

Issue Analytics

Top GitHub Comments

Top Results From Across the Web

Top Related Medium Post

Top Related StackOverflow Question

Troubleshoot Live Code

Top Related Reddit Thread

Top Related Hackernoon Post

Top Related Tweet

Top Related Dev.to Post

Top Related Hashnode Post

ValueError ValueError: numpy.ufunc has the wrong size, try recompiling. Expected 192, got 216

FR: Accept a file-like object in addition to a path in `fastparquet.write`