Error importing Parquet from HDFS using PyArrow
See original GitHub issueHi,
I get the following error when attempting to read a parquet file stored on hdfs:
df = dataframe.read_parquet('hdfs://hdfsnn:8020/user/hdfs/ndsparq/Test.parquet', engine='arrow')
---------------------------------------------------------------------------
OSError Traceback (most recent call last)
<ipython-input-3-22d8573a1d6f> in <module>()
----> 1 df = dataframe.read_parquet('hdfs://hdfsnn:8020/user/hdfs/ndsparq/Test.parquet', engine='arrow')
/opt/anaconda3/envs/env0/lib/python3.6/site-packages/dask/dataframe/io/parquet.py in read_parquet(path, columns, filters, categories, index, storage_options, engine)
293 return _read_pyarrow(fs, paths, file_opener, columns=columns,
294 filters=filters,
--> 295 categories=categories, index=index)
296
297
/opt/anaconda3/envs/env0/lib/python3.6/site-packages/dask/dataframe/io/parquet.py in _read_pyarrow(fs, paths, file_opener, columns, filters, categories, index)
164 columns = list(columns)
165
--> 166 dataset = api.ParquetDataset(paths)
167 schema = dataset.schema.to_arrow_schema()
168 task_name = 'read-parquet-' + tokenize(dataset, columns)
/opt/anaconda3/envs/env0/lib/python3.6/site-packages/pyarrow/parquet.py in __init__(self, path_or_paths, filesystem, schema, metadata, split_row_groups, validate_schema)
619
620 (self.pieces, self.partitions,
--> 621 self.metadata_path) = _make_manifest(path_or_paths, self.fs)
622
623 if self.metadata_path is not None:
/opt/anaconda3/envs/env0/lib/python3.6/site-packages/pyarrow/parquet.py in _make_manifest(path_or_paths, fs, pathsep)
779 if not fs.isfile(path):
780 raise IOError('Passed non-file path: {0}'
--> 781 .format(path))
782 piece = ParquetDatasetPiece(path)
783 pieces.append(piece)
OSError: Passed non-file path: hdfs://hdfsnn:8020/user/hdfs/ndsparq/Test.parquet
The same parquet file can be read using PyArrow without issue:
with hdfs.open('/user/hdfs/ndsparq/Test.parquet', 'rb') as f:
table = pq.read_table(f)
df = table.to_pandas()
Are you able to help? It looks like the filesystem is not passed to api.ParquetDataset
in _read_pyarrow()
.
Thanks very much!
Issue Analytics
- State:
- Created 6 years ago
- Comments:27 (20 by maintainers)
Top Results From Across the Web
[Python] ParquetDataset cannot take HadoopFileSystem as ...
_hdfs.HadoopFileSystem'> When I tried using the deprecated HadoopFileSystem: import pyarrow import pyarrow.parquet as pq file_name = "hdfs:// ...
Read more >Python error using pyarrow - ArrowNotImplementedError ...
The idea is to write a pandas DataFrame as a Parquet Dataset (on Windows) using Snappy compression, and later to process the Parquet...
Read more >pandas.read_parquet — pandas 1.5.2 documentation
Load a parquet object from the file path, returning a DataFrame. ... Both pyarrow and fastparquet support paths to directories as well as...
Read more >Parquet files and data sets on a remote file system with ...
When I started googling, I couldn't really find an “exact” solution for this problem. Even the official documentation of the pyarrow library ...
Read more >Source code for petastorm.reader - Read the Docs
Current choices are libhdfs (java through JNI) or libhdfs3 (C++) :param ... These will be applied when loading the parquet file with PyArrow....
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
We could create a pure Python library to define a abstract public API for these sorts of things. Then we aren’t relying on duck typing
Thanks, will move it to their issue tracker. 😃