Parquet filepath parsing error
See original GitHub issueI attempted to load a large Parquet dataset on S3 with
import dask.dataframe as dd
df = dd.read_parquet("s3://oss-shared-scratch/timeseries.parquet", storage_options=storage_options)
but encountered the following error from pyarrow (full traceback down below):
OSError: Passed non-file path: oss-shared-scratch/timeseries.parquet
I’m not sure if the underlying issue is with pyarrow or dask or fsspec. I’ll attempt to create a minimal reproducer, but thought the traceback might be useful by itself.
cc @jorisvandenbossche @rjzamora
Full traceback:
---------------------------------------------------------------------------
OSError Traceback (most recent call last)
<timed exec> in <module>
~/projects/dask/dask/dask/dataframe/io/parquet/core.py in read_parquet(path, columns, filters, categories, index, storage_options, engine, gather_statistics, split_row_groups, read_from_paths, chunksize, aggregate_files, **kwargs)
314 gather_statistics = True
315
--> 316 read_metadata_result = engine.read_metadata(
317 fs,
318 paths,
~/projects/dask/dask/dask/dataframe/io/parquet/arrow.py in read_metadata(cls, fs, paths, categories, index, gather_statistics, filters, split_row_groups, read_from_paths, chunksize, aggregate_files, **kwargs)
540 split_row_groups,
541 gather_statistics,
--> 542 ) = cls._gather_metadata(
543 paths,
544 fs,
~/projects/dask/dask/dask/dataframe/io/parquet/arrow.py in _gather_metadata(cls, paths, fs, split_row_groups, gather_statistics, filters, index, dataset_kwargs)
1786
1787 # Step 1: Create a ParquetDataset object
-> 1788 dataset, base, fns = _get_dataset_object(paths, fs, filters, dataset_kwargs)
1789 if fns == [None]:
1790 # This is a single file. No danger in gathering statistics
~/projects/dask/dask/dask/dataframe/io/parquet/arrow.py in _get_dataset_object(paths, fs, filters, dataset_kwargs)
1754 base = paths[0]
1755 fns = [None]
-> 1756 dataset = pq.ParquetDataset(paths[0], filesystem=fs, **kwargs)
1757
1758 return dataset, base, fns
~/mambaforge/envs/coiled/lib/python3.9/site-packages/pyarrow/parquet.py in __init__(self, path_or_paths, filesystem, schema, metadata, split_row_groups, validate_schema, filters, metadata_nthreads, read_dictionary, memory_map, buffer_size, partitioning, use_legacy_dataset, pre_buffer, coerce_int96_timestamp_unit)
1330 self._partitions,
1331 self.common_metadata_path,
-> 1332 self.metadata_path) = _make_manifest(
1333 path_or_paths, self._fs, metadata_nthreads=metadata_nthreads,
1334 open_file_func=partial(_open_dataset_file, self._metadata)
~/mambaforge/envs/coiled/lib/python3.9/site-packages/pyarrow/parquet.py in _make_manifest(path_or_paths, fs, pathsep, metadata_nthreads, open_file_func)
1577 for path in path_or_paths:
1578 if not fs.isfile(path):
-> 1579 raise OSError('Passed non-file path: {}'
1580 .format(path))
1581 piece = ParquetDatasetPiece._create(
OSError: Passed non-file path: oss-shared-scratch/timeseries.parquet
Issue Analytics
- State:
- Created 2 years ago
- Comments:13 (13 by maintainers)
Top Results From Across the Web
Unable to read a parquet file - Stack Overflow
I am thrown an error that reads OSError: Passed non-file path: \datasets\proj\train\train.parquet each time I try to create a df from it.
Read more >Parquet file parsing error · Issue #119 · ironSource/parquetjs
Hi have created fruits.parquet file using sample writer.js. When i submit the file to parquet viewer online i am getting Unknown error while ......
Read more >Troubleshooting Reads from ORC and Parquet Files - Vertica
If a Parquet file contains a column of type STRING but the column in Vertica is of a different type, such as INTEGER,...
Read more >FAQ: Subdirectory name in a file path should not match with ...
Error parsing the parquet file: Invalid Parquet file size is 0 bytes File '/new_partition/new/_remove' Row 0 starts at line 0, column If you ......
Read more >Converting from Parquet to Delta Lake fails - Azure Databricks
The directory containing the Parquet file contains one or more subdirectories. The conversion fails with the error message: Expecting 0 ...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found

Listing files on s3 can be slow, because a limited number of files are returned in each call (with a surprising amount of bytes, due to XML encoding and lots of details), and you need to paginate through serially. Since we are explicitly moving towards a no _metadata and many datafiles in many directories model, this cost will be bigger in the future. You would notice not caching particularly if you had reason to
dd.read_parquetmultiple times on the same dataset, perhaps because you want to try different options.(If there were a _metadata, there would be no need to list at all. Of course, my intuition might be out of date by now…)
Correct, you can always have stale info when files are modified in another process. If all you want to do is open a file with known path, that’s probably always fine, so this only matters on the client. Which argues that it doesn’t matter much whether you have it on the workers.
One note: sharing FS instances is good for cutting the cost of establishing connections/credentials; if the uses has one they are using to list files and find their data, with the default listings cache, and we choose to make a another one without, then it will be a separate instance and session. This point argues (weakly) towards invalidating the cache rather than disabling it.
#8994.