read_parquet fails when reading parquet folder from s3
See original GitHub issueWhat happened:
After creating a Dask DF, I am able to successfully write a Parquet file to s3 using ddf.to_parquet(s3_path)
. The result is a folder being created on s3 which contains the individual parquet files for each partition. Then when trying to read that file back in dd.read_parquet(s3_path)
fails with the stack trace below (in details dropdown).
Stack Trace:
/home/hnf396/.conda/envs/seg-lvl/lib/python3.7/site-packages/pyarrow/filesystem.py:412: RuntimeWarning: coroutine 'S3FileSystem._ls' was never awaited
for key in list(self.fs._ls(path, refresh=refresh)):
RuntimeWarning: Enable tracemalloc to get the object allocation traceback
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-4-0ff7e6219bb0> in <module>
12
13 ddf.to_parquet(s3_path, engine='pyarrow')
---> 14 ddf2 = dd.read_parquet(s3_path, engine='pyarrow')
~/.conda/envs/seg-lvl/lib/python3.7/site-packages/dask/dataframe/io/parquet/core.py in read_parquet(path, columns, filters, categories, index, storage_options, engine, gather_statistics, split_row_groups, chunksize, **kwargs)
237 filters=filters,
238 split_row_groups=split_row_groups,
--> 239 **kwargs,
240 )
241
~/.conda/envs/seg-lvl/lib/python3.7/site-packages/dask/dataframe/io/parquet/arrow.py in read_metadata(cls, fs, paths, categories, index, gather_statistics, filters, split_row_groups, **kwargs)
654 gather_statistics,
655 ) = _gather_metadata(
--> 656 paths, fs, split_row_groups, gather_statistics, filters, dataset_kwargs
657 )
658
~/.conda/envs/seg-lvl/lib/python3.7/site-packages/dask/dataframe/io/parquet/arrow.py in _gather_metadata(paths, fs, split_row_groups, gather_statistics, filters, dataset_kwargs)
164
165 # Step 1: Create a ParquetDataset object
--> 166 dataset, base, fns = _get_dataset_object(paths, fs, filters, dataset_kwargs)
167 if fns == [None]:
168 # This is a single file. No danger in gathering statistics
~/.conda/envs/seg-lvl/lib/python3.7/site-packages/dask/dataframe/io/parquet/arrow.py in _get_dataset_object(paths, fs, filters, dataset_kwargs)
142 allpaths = fs.glob(paths[0] + fs.sep + "*")
143 base, fns = _analyze_paths(allpaths, fs)
--> 144 dataset = pq.ParquetDataset(paths[0], filesystem=fs, filters=filters, **kwargs)
145 else:
146 # This is a single file. No danger in gathering statistics
~/.conda/envs/seg-lvl/lib/python3.7/site-packages/pyarrow/parquet.py in __init__(self, path_or_paths, filesystem, schema, metadata, split_row_groups, validate_schema, filters, metadata_nthreads, read_dictionary, memory_map, buffer_size, partitioning, use_legacy_dataset)
1180 self.metadata_path) = _make_manifest(
1181 path_or_paths, self.fs, metadata_nthreads=metadata_nthreads,
-> 1182 open_file_func=partial(_open_dataset_file, self._metadata)
1183 )
1184
~/.conda/envs/seg-lvl/lib/python3.7/site-packages/pyarrow/parquet.py in _make_manifest(path_or_paths, fs, pathsep, metadata_nthreads, open_file_func)
1358 open_file_func=open_file_func,
1359 pathsep=getattr(fs, "pathsep", "/"),
-> 1360 metadata_nthreads=metadata_nthreads)
1361 common_metadata_path = manifest.common_metadata_path
1362 metadata_path = manifest.metadata_path
~/.conda/envs/seg-lvl/lib/python3.7/site-packages/pyarrow/parquet.py in __init__(self, dirpath, open_file_func, filesystem, pathsep, partition_scheme, metadata_nthreads)
928 self.metadata_path = None
929
--> 930 self._visit_level(0, self.dirpath, [])
931
932 # Due to concurrency, pieces will potentially by out of order if the
~/.conda/envs/seg-lvl/lib/python3.7/site-packages/pyarrow/parquet.py in _visit_level(self, level, base_path, part_keys)
943 fs = self.filesystem
944
--> 945 _, directories, files = next(fs.walk(base_path))
946
947 filtered_files = []
~/.conda/envs/seg-lvl/lib/python3.7/site-packages/pyarrow/filesystem.py in walk(self, path, refresh)
410 files = set()
411
--> 412 for key in list(self.fs._ls(path, refresh=refresh)):
413 path = key['Key']
414 if key['StorageClass'] == 'DIRECTORY':
TypeError: 'coroutine' object is not iterable
What you expected to happen:
The parquet file being read in correctly upon calling dd.read_parquet(s3_path)
Minimal Complete Verifiable Example:
import pandas as pd
import dask.dataframe as dd
pdf = pd.DataFrame({
'a': [1, 2],
'b': [3, 4]
})
ddf = dd.from_pandas(pdf, npartitions=1)
s3_path = 's3://bucket/path/tmp.parquet'
ddf.to_parquet(s3_path)
ddf2 = dd.read_parquet(s3_path)
Anything else we need to know?:
This issue does not happen when writing to the local disk and does not happen when writing CSVs to s3 (despite everything else staying consistent). This issue also does not happen if the s3_path points to a specific parquet file (eg. the output of pdf.to_parquet(s3_path)
or ddf.compute().to_parquet(s3_path)
), rather than a folder containing parquet files for each partition (eg. the output of ddf.to_parquet(s3_path)
).
Environment:
- Dask version: 2.30.0
- Python version: 3.7.9
- Operating System: Linux
- Install method (conda, pip, source): pip
Issue Analytics
- State:
- Created 3 years ago
- Comments:6 (4 by maintainers)
Ok, @justinessert , I’m going to close this out, then – pyarrow 2.0 support is on the way and is being tracked in a separate issue.
Can confirm that this works for me too. Thank you!