question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

read_parquet fails when reading parquet folder from s3

See original GitHub issue

What happened:

After creating a Dask DF, I am able to successfully write a Parquet file to s3 using ddf.to_parquet(s3_path). The result is a folder being created on s3 which contains the individual parquet files for each partition. Then when trying to read that file back in dd.read_parquet(s3_path) fails with the stack trace below (in details dropdown).

Stack Trace:

/home/hnf396/.conda/envs/seg-lvl/lib/python3.7/site-packages/pyarrow/filesystem.py:412: RuntimeWarning: coroutine 'S3FileSystem._ls' was never awaited
  for key in list(self.fs._ls(path, refresh=refresh)):
RuntimeWarning: Enable tracemalloc to get the object allocation traceback
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-4-0ff7e6219bb0> in <module>
     12 
     13 ddf.to_parquet(s3_path, engine='pyarrow')
---> 14 ddf2 = dd.read_parquet(s3_path, engine='pyarrow')

~/.conda/envs/seg-lvl/lib/python3.7/site-packages/dask/dataframe/io/parquet/core.py in read_parquet(path, columns, filters, categories, index, storage_options, engine, gather_statistics, split_row_groups, chunksize, **kwargs)
    237         filters=filters,
    238         split_row_groups=split_row_groups,
--> 239         **kwargs,
    240     )
    241 

~/.conda/envs/seg-lvl/lib/python3.7/site-packages/dask/dataframe/io/parquet/arrow.py in read_metadata(cls, fs, paths, categories, index, gather_statistics, filters, split_row_groups, **kwargs)
    654             gather_statistics,
    655         ) = _gather_metadata(
--> 656             paths, fs, split_row_groups, gather_statistics, filters, dataset_kwargs
    657         )
    658 

~/.conda/envs/seg-lvl/lib/python3.7/site-packages/dask/dataframe/io/parquet/arrow.py in _gather_metadata(paths, fs, split_row_groups, gather_statistics, filters, dataset_kwargs)
    164 
    165     # Step 1: Create a ParquetDataset object
--> 166     dataset, base, fns = _get_dataset_object(paths, fs, filters, dataset_kwargs)
    167     if fns == [None]:
    168         # This is a single file. No danger in gathering statistics

~/.conda/envs/seg-lvl/lib/python3.7/site-packages/dask/dataframe/io/parquet/arrow.py in _get_dataset_object(paths, fs, filters, dataset_kwargs)
    142         allpaths = fs.glob(paths[0] + fs.sep + "*")
    143         base, fns = _analyze_paths(allpaths, fs)
--> 144         dataset = pq.ParquetDataset(paths[0], filesystem=fs, filters=filters, **kwargs)
    145     else:
    146         # This is a single file.  No danger in gathering statistics

~/.conda/envs/seg-lvl/lib/python3.7/site-packages/pyarrow/parquet.py in __init__(self, path_or_paths, filesystem, schema, metadata, split_row_groups, validate_schema, filters, metadata_nthreads, read_dictionary, memory_map, buffer_size, partitioning, use_legacy_dataset)
   1180          self.metadata_path) = _make_manifest(
   1181              path_or_paths, self.fs, metadata_nthreads=metadata_nthreads,
-> 1182              open_file_func=partial(_open_dataset_file, self._metadata)
   1183         )
   1184 

~/.conda/envs/seg-lvl/lib/python3.7/site-packages/pyarrow/parquet.py in _make_manifest(path_or_paths, fs, pathsep, metadata_nthreads, open_file_func)
   1358                                    open_file_func=open_file_func,
   1359                                    pathsep=getattr(fs, "pathsep", "/"),
-> 1360                                    metadata_nthreads=metadata_nthreads)
   1361         common_metadata_path = manifest.common_metadata_path
   1362         metadata_path = manifest.metadata_path

~/.conda/envs/seg-lvl/lib/python3.7/site-packages/pyarrow/parquet.py in __init__(self, dirpath, open_file_func, filesystem, pathsep, partition_scheme, metadata_nthreads)
    928         self.metadata_path = None
    929 
--> 930         self._visit_level(0, self.dirpath, [])
    931 
    932         # Due to concurrency, pieces will potentially by out of order if the

~/.conda/envs/seg-lvl/lib/python3.7/site-packages/pyarrow/parquet.py in _visit_level(self, level, base_path, part_keys)
    943         fs = self.filesystem
    944 
--> 945         _, directories, files = next(fs.walk(base_path))
    946 
    947         filtered_files = []

~/.conda/envs/seg-lvl/lib/python3.7/site-packages/pyarrow/filesystem.py in walk(self, path, refresh)
    410         files = set()
    411 
--> 412         for key in list(self.fs._ls(path, refresh=refresh)):
    413             path = key['Key']
    414             if key['StorageClass'] == 'DIRECTORY':

TypeError: 'coroutine' object is not iterable

What you expected to happen:

The parquet file being read in correctly upon calling dd.read_parquet(s3_path)

Minimal Complete Verifiable Example:

import pandas as pd
import dask.dataframe as dd

pdf = pd.DataFrame({
    'a': [1, 2],
    'b': [3, 4]
})

ddf = dd.from_pandas(pdf, npartitions=1)

s3_path = 's3://bucket/path/tmp.parquet'

ddf.to_parquet(s3_path)
ddf2 = dd.read_parquet(s3_path)

Anything else we need to know?:

This issue does not happen when writing to the local disk and does not happen when writing CSVs to s3 (despite everything else staying consistent). This issue also does not happen if the s3_path points to a specific parquet file (eg. the output of pdf.to_parquet(s3_path) or ddf.compute().to_parquet(s3_path)), rather than a folder containing parquet files for each partition (eg. the output of ddf.to_parquet(s3_path)).

Environment:

  • Dask version: 2.30.0
  • Python version: 3.7.9
  • Operating System: Linux
  • Install method (conda, pip, source): pip

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:6 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
gforsythcommented, Nov 12, 2020

Ok, @justinessert , I’m going to close this out, then – pyarrow 2.0 support is on the way and is being tracked in a separate issue.

0reactions
justinessertcommented, Nov 12, 2020

Can confirm that this works for me too. Thank you!

Read more comments on GitHub >

github_iconTop Results From Across the Web

Cannot read parquet files in s3 bucket with Pyspark 2.4.4
The issue is hidden at the end of the Java stacktrace and is independent from the file being Parquet. What is missing is...
Read more >
Read and Write Parquet file from Amazon S3
Spark read from & write to parquet file | Amazon S3 bucket In this Spark tutorial, you will learn what is Apache Parquet,...
Read more >
Spark.read.parquet fails when Job bookmark is enabled in ...
I found the issue which seems to be a bug in Glue. The problem is that the temporary folder path is defined in...
Read more >
Parquet Files - Spark 3.3.1 Documentation
Spark SQL provides support for both reading and writing Parquet files that automatically ... Read in the parquet file created above // Parquet...
Read more >
pandas.read_parquet — pandas 1.5.2 documentation
Both pyarrow and fastparquet support paths to directories as well as file URLs. A directory path could be: file://localhost/path/to/tables or s3://bucket/ ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found