Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Error importing Parquet from HDFS using PyArrow

See original GitHub issue

Hi,

I get the following error when attempting to read a parquet file stored on hdfs:

df = dataframe.read_parquet('hdfs://hdfsnn:8020/user/hdfs/ndsparq/Test.parquet', engine='arrow')
---------------------------------------------------------------------------
OSError                                   Traceback (most recent call last)
<ipython-input-3-22d8573a1d6f> in <module>()
----> 1 df = dataframe.read_parquet('hdfs://hdfsnn:8020/user/hdfs/ndsparq/Test.parquet', engine='arrow')

/opt/anaconda3/envs/env0/lib/python3.6/site-packages/dask/dataframe/io/parquet.py in read_parquet(path, columns, filters, categories, index, storage_options, engine)
    293         return _read_pyarrow(fs, paths, file_opener, columns=columns,
    294                              filters=filters,
--> 295                              categories=categories, index=index)
    296 
    297 

/opt/anaconda3/envs/env0/lib/python3.6/site-packages/dask/dataframe/io/parquet.py in _read_pyarrow(fs, paths, file_opener, columns, filters, categories, index)
    164         columns = list(columns)
    165 
--> 166     dataset = api.ParquetDataset(paths)
    167     schema = dataset.schema.to_arrow_schema()
    168     task_name = 'read-parquet-' + tokenize(dataset, columns)

/opt/anaconda3/envs/env0/lib/python3.6/site-packages/pyarrow/parquet.py in __init__(self, path_or_paths, filesystem, schema, metadata, split_row_groups, validate_schema)
    619 
    620         (self.pieces, self.partitions,
--> 621          self.metadata_path) = _make_manifest(path_or_paths, self.fs)
    622 
    623         if self.metadata_path is not None:

/opt/anaconda3/envs/env0/lib/python3.6/site-packages/pyarrow/parquet.py in _make_manifest(path_or_paths, fs, pathsep)
    779             if not fs.isfile(path):
    780                 raise IOError('Passed non-file path: {0}'
--> 781                               .format(path))
    782             piece = ParquetDatasetPiece(path)
    783             pieces.append(piece)

OSError: Passed non-file path: hdfs://hdfsnn:8020/user/hdfs/ndsparq/Test.parquet

The same parquet file can be read using PyArrow without issue:

with hdfs.open('/user/hdfs/ndsparq/Test.parquet', 'rb') as f:
    table = pq.read_table(f)
df = table.to_pandas()

Are you able to help? It looks like the filesystem is not passed to api.ParquetDataset in _read_pyarrow().

Thanks very much!

Issue Analytics

State:
Created 6 years ago
Comments:27 (20 by maintainers)

Top GitHub Comments

1reaction

wesmcommented, Nov 10, 2017

We could create a pure Python library to define a abstract public API for these sorts of things. Then we aren’t relying on duck typing

0reactions

RoelantStegmanncommented, Nov 19, 2019

Thanks, will move it to their issue tracker. 😃

Top Results From Across the Web

[Python] ParquetDataset cannot take HadoopFileSystem as ...

_hdfs.HadoopFileSystem'> When I tried using the deprecated HadoopFileSystem: import pyarrow import pyarrow.parquet as pq file_name = "hdfs:// ...

Python error using pyarrow - ArrowNotImplementedError ...

The idea is to write a pandas DataFrame as a Parquet Dataset (on Windows) using Snappy compression, and later to process the Parquet...

pandas.read_parquet — pandas 1.5.2 documentation

Load a parquet object from the file path, returning a DataFrame. ... Both pyarrow and fastparquet support paths to directories as well as...

Parquet files and data sets on a remote file system with ...

When I started googling, I couldn't really find an “exact” solution for this problem. Even the official documentation of the pyarrow library ...

Source code for petastorm.reader - Read the Docs

Current choices are libhdfs (java through JNI) or libhdfs3 (C++) :param ... These will be applied when loading the parquet file with PyArrow....

Troubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.

Start Free

Top Related Reddit Thread

No results found

Top Related Tweet

No results found

Top Related Dev.to Post

No results found

Error importing Parquet from HDFS using PyArrow

Issue Analytics

Top GitHub Comments

Top Results From Across the Web

Top Related Medium Post

Top Related StackOverflow Question

Troubleshoot Live Code

Top Related Reddit Thread

Top Related Hackernoon Post

Top Related Tweet

Top Related Dev.to Post

Top Related Hashnode Post

Keep original filenames in dask.dataframe.read_csv

min/max fail on empty chunks