question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

BUG: read_parquet no longer supports file-like objects

See original GitHub issue

Code Sample, a copy-pastable example

from io import BytesIO
import pandas as pd

buffer = BytesIO()

df = pd.DataFrame([1,2,3], columns=["a"])
df.to_parquet(buffer)

df2 = pd.read_parquet(buffer)

Problem description

The current behavior of read_parquet(buffer) is that it raises the following exception:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "./working_dir/tvenv/lib/python3.7/site-packages/pandas/io/parquet.py", line 315, in read_parquet
    return impl.read(path, columns=columns, **kwargs)
  File "./working_dir/tvenv/lib/python3.7/site-packages/pandas/io/parquet.py", line 131, in read
    path, filesystem=get_fs_for_path(path), **kwargs
  File "./working_dir/tvenv/lib/python3.7/site-packages/pyarrow/parquet.py", line 1162, in __init__
    self.paths = _parse_uri(path_or_paths)
  File "./working_dir/tvenv/lib/python3.7/site-packages/pyarrow/parquet.py", line 47, in _parse_uri
    path = _stringify_path(path)
  File "./working_dir/tvenv/lib/python3.7/site-packages/pyarrow/util.py", line 67, in _stringify_path
    raise TypeError("not a path-like object")
TypeError: not a path-like object

Expected Output

Instead, read_parquet(buffer) should return a new DataFrame with the same contents as the serialized DataFrame stored in buffer

Output of pd.show_versions()

INSTALLED VERSIONS

commit : None python : 3.7.5.final.0 python-bits : 64 OS : Linux OS-release : 4.15.0-99-generic machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8

pandas : 1.0.4 numpy : 1.18.4 pytz : 2020.1 dateutil : 2.8.1 pip : 9.0.1 setuptools : 39.0.1 Cython : None pytest : None hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : 0.999999999 pymysql : None psycopg2 : None jinja2 : None IPython : None pandas_datareader: None bs4 : None bottleneck : None fastparquet : None gcsfs : None lxml.etree : None matplotlib : None numexpr : None odfpy : None openpyxl : None pandas_gbq : None pyarrow : 0.17.1 pytables : None pytest : None pyxlsb : None s3fs : None scipy : None sqlalchemy : None tables : None tabulate : None xarray : None xlrd : None xlwt : None xlsxwriter : None numba : None

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Reactions:8
  • Comments:26 (14 by maintainers)

github_iconTop GitHub Comments

3reactions
austospumantocommented, May 30, 2020

@claytonlemons I am encountering the same issue.

If I downgrade from 1.0.4 --> 1.0.3 (while keeping the pyarrow version the same), I can again read from BytesIO buffers without issue. Since upgrading the pandas version from 1.0.3 --> 1.0.4 seems both necessary and sufficient to cause the file-like object reading issues, it seems like it may indeed be correct to consider this as an issue with pandas, not pyarrow.

@jreback Would you consider reopening this issue?

2reactions
alimcmaster1commented, Jun 8, 2020

@kepler I wonder if explicitly separating the kwargs into two parameters might be a solution.

@austospumanto

The fix for master pandas 1.1 is https://github.com/pandas-dev/pandas/pull/34500/files#diff-cbd427661c53f1dcde6ec5fb9ab0effaR134

We can potentially add tests that’s cover a few more of the kwargs since we clearly currently don’t have coverage here.

Read more comments on GitHub >

github_iconTop Results From Across the Web

reading parquet to pandas FileNotFoundError - Stack Overflow
1 Answer 1 ; hdfs · Client with client.read('somepath/data.parquet') as f: df_pp = pd.read_parquet(f.read()) ; hdfs3 · HDFileSystem hdfs = ...
Read more >
Cannot read parquet files with hive when created with a ...
I have created a mapping that write some data into HDFS in parquet format. ... it in a Spark envororment because I'm using...
Read more >
Reading and Writing the Apache Parquet Format
As you can learn more in the Apache Parquet format, a Parquet file consists of multiple row groups. read_table will read all of...
Read more >
Error writing parquet files - Databricks Community
What's happening here is that Spark is reading a file and has a list of parquet file names that it wants to pull...
Read more >
Optimizing Access to Parquet Data with fsspec
Now, these file-like objects can be treated the same way as local ... approach and do not explicitly minimize the number of read...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found