question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Error importing Parquet from HDFS using PyArrow

See original GitHub issue

Hi,

I get the following error when attempting to read a parquet file stored on hdfs:

df = dataframe.read_parquet('hdfs://hdfsnn:8020/user/hdfs/ndsparq/Test.parquet', engine='arrow')
---------------------------------------------------------------------------
OSError                                   Traceback (most recent call last)
<ipython-input-3-22d8573a1d6f> in <module>()
----> 1 df = dataframe.read_parquet('hdfs://hdfsnn:8020/user/hdfs/ndsparq/Test.parquet', engine='arrow')

/opt/anaconda3/envs/env0/lib/python3.6/site-packages/dask/dataframe/io/parquet.py in read_parquet(path, columns, filters, categories, index, storage_options, engine)
    293         return _read_pyarrow(fs, paths, file_opener, columns=columns,
    294                              filters=filters,
--> 295                              categories=categories, index=index)
    296 
    297 

/opt/anaconda3/envs/env0/lib/python3.6/site-packages/dask/dataframe/io/parquet.py in _read_pyarrow(fs, paths, file_opener, columns, filters, categories, index)
    164         columns = list(columns)
    165 
--> 166     dataset = api.ParquetDataset(paths)
    167     schema = dataset.schema.to_arrow_schema()
    168     task_name = 'read-parquet-' + tokenize(dataset, columns)

/opt/anaconda3/envs/env0/lib/python3.6/site-packages/pyarrow/parquet.py in __init__(self, path_or_paths, filesystem, schema, metadata, split_row_groups, validate_schema)
    619 
    620         (self.pieces, self.partitions,
--> 621          self.metadata_path) = _make_manifest(path_or_paths, self.fs)
    622 
    623         if self.metadata_path is not None:

/opt/anaconda3/envs/env0/lib/python3.6/site-packages/pyarrow/parquet.py in _make_manifest(path_or_paths, fs, pathsep)
    779             if not fs.isfile(path):
    780                 raise IOError('Passed non-file path: {0}'
--> 781                               .format(path))
    782             piece = ParquetDatasetPiece(path)
    783             pieces.append(piece)

OSError: Passed non-file path: hdfs://hdfsnn:8020/user/hdfs/ndsparq/Test.parquet

The same parquet file can be read using PyArrow without issue:

with hdfs.open('/user/hdfs/ndsparq/Test.parquet', 'rb') as f:
    table = pq.read_table(f)
df = table.to_pandas()

Are you able to help? It looks like the filesystem is not passed to api.ParquetDataset in _read_pyarrow().

Thanks very much!

Issue Analytics

  • State:closed
  • Created 6 years ago
  • Comments:27 (20 by maintainers)

github_iconTop GitHub Comments

1reaction
wesmcommented, Nov 10, 2017

We could create a pure Python library to define a abstract public API for these sorts of things. Then we aren’t relying on duck typing

0reactions
RoelantStegmanncommented, Nov 19, 2019

Thanks, will move it to their issue tracker. 😃

Read more comments on GitHub >

github_iconTop Results From Across the Web

[Python] ParquetDataset cannot take HadoopFileSystem as ...
_hdfs.HadoopFileSystem'> When I tried using the deprecated HadoopFileSystem: import pyarrow import pyarrow.parquet as pq file_name = "hdfs:// ...
Read more >
Python error using pyarrow - ArrowNotImplementedError ...
The idea is to write a pandas DataFrame as a Parquet Dataset (on Windows) using Snappy compression, and later to process the Parquet...
Read more >
pandas.read_parquet — pandas 1.5.2 documentation
Load a parquet object from the file path, returning a DataFrame. ... Both pyarrow and fastparquet support paths to directories as well as...
Read more >
Parquet files and data sets on a remote file system with ...
When I started googling, I couldn't really find an “exact” solution for this problem. Even the official documentation of the pyarrow library ...
Read more >
Source code for petastorm.reader - Read the Docs
Current choices are libhdfs (java through JNI) or libhdfs3 (C++) :param ... These will be applied when loading the parquet file with PyArrow....
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found