question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Parquet filepath parsing error

See original GitHub issue

I attempted to load a large Parquet dataset on S3 with

import dask.dataframe as dd

df = dd.read_parquet("s3://oss-shared-scratch/timeseries.parquet", storage_options=storage_options)

but encountered the following error from pyarrow (full traceback down below):

OSError: Passed non-file path: oss-shared-scratch/timeseries.parquet

I’m not sure if the underlying issue is with pyarrow or dask or fsspec. I’ll attempt to create a minimal reproducer, but thought the traceback might be useful by itself.

cc @jorisvandenbossche @rjzamora

Full traceback:
---------------------------------------------------------------------------
OSError                                   Traceback (most recent call last)
<timed exec> in <module>

~/projects/dask/dask/dask/dataframe/io/parquet/core.py in read_parquet(path, columns, filters, categories, index, storage_options, engine, gather_statistics, split_row_groups, read_from_paths, chunksize, aggregate_files, **kwargs)
    314         gather_statistics = True
    315 
--> 316     read_metadata_result = engine.read_metadata(
    317         fs,
    318         paths,

~/projects/dask/dask/dask/dataframe/io/parquet/arrow.py in read_metadata(cls, fs, paths, categories, index, gather_statistics, filters, split_row_groups, read_from_paths, chunksize, aggregate_files, **kwargs)
    540             split_row_groups,
    541             gather_statistics,
--> 542         ) = cls._gather_metadata(
    543             paths,
    544             fs,

~/projects/dask/dask/dask/dataframe/io/parquet/arrow.py in _gather_metadata(cls, paths, fs, split_row_groups, gather_statistics, filters, index, dataset_kwargs)
   1786 
   1787         # Step 1: Create a ParquetDataset object
-> 1788         dataset, base, fns = _get_dataset_object(paths, fs, filters, dataset_kwargs)
   1789         if fns == [None]:
   1790             # This is a single file. No danger in gathering statistics

~/projects/dask/dask/dask/dataframe/io/parquet/arrow.py in _get_dataset_object(paths, fs, filters, dataset_kwargs)
   1754         base = paths[0]
   1755         fns = [None]
-> 1756         dataset = pq.ParquetDataset(paths[0], filesystem=fs, **kwargs)
   1757 
   1758     return dataset, base, fns

~/mambaforge/envs/coiled/lib/python3.9/site-packages/pyarrow/parquet.py in __init__(self, path_or_paths, filesystem, schema, metadata, split_row_groups, validate_schema, filters, metadata_nthreads, read_dictionary, memory_map, buffer_size, partitioning, use_legacy_dataset, pre_buffer, coerce_int96_timestamp_unit)
   1330          self._partitions,
   1331          self.common_metadata_path,
-> 1332          self.metadata_path) = _make_manifest(
   1333              path_or_paths, self._fs, metadata_nthreads=metadata_nthreads,
   1334              open_file_func=partial(_open_dataset_file, self._metadata)

~/mambaforge/envs/coiled/lib/python3.9/site-packages/pyarrow/parquet.py in _make_manifest(path_or_paths, fs, pathsep, metadata_nthreads, open_file_func)
   1577         for path in path_or_paths:
   1578             if not fs.isfile(path):
-> 1579                 raise OSError('Passed non-file path: {}'
   1580                               .format(path))
   1581             piece = ParquetDatasetPiece._create(

OSError: Passed non-file path: oss-shared-scratch/timeseries.parquet

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:13 (13 by maintainers)

github_iconTop GitHub Comments

1reaction
martindurantcommented, Apr 27, 2022

Listing files on s3 can be slow, because a limited number of files are returned in each call (with a surprising amount of bytes, due to XML encoding and lots of details), and you need to paginate through serially. Since we are explicitly moving towards a no _metadata and many datafiles in many directories model, this cost will be bigger in the future. You would notice not caching particularly if you had reason to dd.read_parquet multiple times on the same dataset, perhaps because you want to try different options.

(If there were a _metadata, there would be no need to list at all. Of course, my intuition might be out of date by now…)

Call fs.invalidate_cache() in non-read operations. (wouldn’t this still not work if files were written by an external process before a read_parquet call?)

Correct, you can always have stale info when files are modified in another process. If all you want to do is open a file with known path, that’s probably always fine, so this only matters on the client. Which argues that it doesn’t matter much whether you have it on the workers.

One note: sharing FS instances is good for cutting the cost of establishing connections/credentials; if the uses has one they are using to list files and find their data, with the default listings cache, and we choose to make a another one without, then it will be a separate instance and session. This point argues (weakly) towards invalidating the cache rather than disabling it.

0reactions
jcristcommented, Apr 28, 2022
Read more comments on GitHub >

github_iconTop Results From Across the Web

Unable to read a parquet file - Stack Overflow
I am thrown an error that reads OSError: Passed non-file path: \datasets\proj\train\train.parquet each time I try to create a df from it.
Read more >
Parquet file parsing error · Issue #119 · ironSource/parquetjs
Hi have created fruits.parquet file using sample writer.js. When i submit the file to parquet viewer online i am getting Unknown error while ......
Read more >
Troubleshooting Reads from ORC and Parquet Files - Vertica
If a Parquet file contains a column of type STRING but the column in Vertica is of a different type, such as INTEGER,...
Read more >
FAQ: Subdirectory name in a file path should not match with ...
Error parsing the parquet file: Invalid Parquet file size is 0 bytes File '/new_partition/new/_remove' Row 0 starts at line 0, column If you ......
Read more >
Converting from Parquet to Delta Lake fails - Azure Databricks
The directory containing the Parquet file contains one or more subdirectories. The conversion fails with the error message: Expecting 0 ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found