Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Parquet filepath parsing error

See original GitHub issue

I attempted to load a large Parquet dataset on S3 with

import dask.dataframe as dd

df = dd.read_parquet("s3://oss-shared-scratch/timeseries.parquet", storage_options=storage_options)

but encountered the following error from pyarrow (full traceback down below):

OSError: Passed non-file path: oss-shared-scratch/timeseries.parquet

I’m not sure if the underlying issue is with pyarrow or dask or fsspec. I’ll attempt to create a minimal reproducer, but thought the traceback might be useful by itself.

cc @jorisvandenbossche @rjzamora

Full traceback:

---------------------------------------------------------------------------
OSError                                   Traceback (most recent call last)
<timed exec> in <module>

~/projects/dask/dask/dask/dataframe/io/parquet/core.py in read_parquet(path, columns, filters, categories, index, storage_options, engine, gather_statistics, split_row_groups, read_from_paths, chunksize, aggregate_files, **kwargs)
    314         gather_statistics = True
    315 
--> 316     read_metadata_result = engine.read_metadata(
    317         fs,
    318         paths,

~/projects/dask/dask/dask/dataframe/io/parquet/arrow.py in read_metadata(cls, fs, paths, categories, index, gather_statistics, filters, split_row_groups, read_from_paths, chunksize, aggregate_files, **kwargs)
    540             split_row_groups,
    541             gather_statistics,
--> 542         ) = cls._gather_metadata(
    543             paths,
    544             fs,

~/projects/dask/dask/dask/dataframe/io/parquet/arrow.py in _gather_metadata(cls, paths, fs, split_row_groups, gather_statistics, filters, index, dataset_kwargs)
   1786 
   1787         # Step 1: Create a ParquetDataset object
-> 1788         dataset, base, fns = _get_dataset_object(paths, fs, filters, dataset_kwargs)
   1789         if fns == [None]:
   1790             # This is a single file. No danger in gathering statistics

~/projects/dask/dask/dask/dataframe/io/parquet/arrow.py in _get_dataset_object(paths, fs, filters, dataset_kwargs)
   1754         base = paths[0]
   1755         fns = [None]
-> 1756         dataset = pq.ParquetDataset(paths[0], filesystem=fs, **kwargs)
   1757 
   1758     return dataset, base, fns

~/mambaforge/envs/coiled/lib/python3.9/site-packages/pyarrow/parquet.py in __init__(self, path_or_paths, filesystem, schema, metadata, split_row_groups, validate_schema, filters, metadata_nthreads, read_dictionary, memory_map, buffer_size, partitioning, use_legacy_dataset, pre_buffer, coerce_int96_timestamp_unit)
   1330          self._partitions,
   1331          self.common_metadata_path,
-> 1332          self.metadata_path) = _make_manifest(
   1333              path_or_paths, self._fs, metadata_nthreads=metadata_nthreads,
   1334              open_file_func=partial(_open_dataset_file, self._metadata)

~/mambaforge/envs/coiled/lib/python3.9/site-packages/pyarrow/parquet.py in _make_manifest(path_or_paths, fs, pathsep, metadata_nthreads, open_file_func)
   1577         for path in path_or_paths:
   1578             if not fs.isfile(path):
-> 1579                 raise OSError('Passed non-file path: {}'
   1580                               .format(path))
   1581             piece = ParquetDatasetPiece._create(

OSError: Passed non-file path: oss-shared-scratch/timeseries.parquet

Issue Analytics

State:
Created 2 years ago
Comments:13 (13 by maintainers)

Top GitHub Comments

1reaction

martindurantcommented, Apr 27, 2022

Listing files on s3 can be slow, because a limited number of files are returned in each call (with a surprising amount of bytes, due to XML encoding and lots of details), and you need to paginate through serially. Since we are explicitly moving towards a no _metadata and many datafiles in many directories model, this cost will be bigger in the future. You would notice not caching particularly if you had reason to dd.read_parquet multiple times on the same dataset, perhaps because you want to try different options.

(If there were a _metadata, there would be no need to list at all. Of course, my intuition might be out of date by now…)

Call fs.invalidate_cache() in non-read operations. (wouldn’t this still not work if files were written by an external process before a read_parquet call?)

Correct, you can always have stale info when files are modified in another process. If all you want to do is open a file with known path, that’s probably always fine, so this only matters on the client. Which argues that it doesn’t matter much whether you have it on the workers.

One note: sharing FS instances is good for cutting the cost of establishing connections/credentials; if the uses has one they are using to list files and find their data, with the default listings cache, and we choose to make a another one without, then it will be a separate instance and session. This point argues (weakly) towards invalidating the cache rather than disabling it.

0reactions

jcristcommented, Apr 28, 2022

#8994.