[Datasets] [Bug] ray.data.read_parquet failed with multiple files
See original GitHub issueRay version:
ray, version 1.9.1
pyarrow version
(ray) ➜ ray-test pip show pyarrow
Name: pyarrow
Version: 4.0.1
Summary: Python library for Apache Arrow
Home-page: https://arrow.apache.org/
Author:
Author-email:
License: Apache License, Version 2.0
Location: /usr/local/Caskroom/miniconda/base/envs/ray/lib/python3.7/site-packages
Requires: numpy
Required-by:
paths = [ … ] # a list of s3 files
this is okay
ds = ray.data.read_parquet(paths[0])
ds.schema()
this is also okay
ds = ray.data.read_parquet(paths[1])
ds.schema()
the schemas are exactly the same. but this is not okay
ds = ray.data.read_parquet(paths[0:2])
with error msg:
2022-01-28 10:31:54,543 INFO services.py:1340 -- View the Ray dashboard at http://127.0.0.1:8266
Traceback (most recent call last):
File "/home/ray/ray_data.py", line 16, in <module>
ds = ray.data.read_parquet([paths[1], paths[2]])
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/data/read_api.py", line 278, in read_parquet
**arrow_parquet_args)
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/data/read_api.py", line 161, in read_datasource
read_tasks = datasource.prepare_read(parallelism, **read_args)
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/data/datasource/parquet_datasource.py", line 65, in prepare_read
use_legacy_dataset=False)
File "/home/ray/anaconda3/lib/python3.7/site-packages/pyarrow/parquet.py", line 1239, in __new__
metadata_nthreads=metadata_nthreads)
File "/home/ray/anaconda3/lib/python3.7/site-packages/pyarrow/parquet.py", line 1561, in __init__
ignore_prefixes=ignore_prefixes)
File "/home/ray/anaconda3/lib/python3.7/site-packages/pyarrow/dataset.py", line 659, in dataset
return _filesystem_dataset(source, **kwargs)
File "/home/ray/anaconda3/lib/python3.7/site-packages/pyarrow/dataset.py", line 411, in _filesystem_dataset
return factory.finish(schema)
File "pyarrow/_dataset.pyx", line 2201, in pyarrow._dataset.DatasetFactory.finish
File "pyarrow/error.pxi", line 141, in pyarrow.lib.pyarrow_internal_check_status
File "pyarrow/error.pxi", line 97, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Unable to merge: Field tag has incompatible types: int32 vs dictionary<values=int32, indices=int32, ordered=0>
Issue Analytics
- State:
- Created 2 years ago
- Comments:6 (3 by maintainers)
Top Results From Across the Web
Ray Dataset Cannot Read Parquet File
Hi everyone,. I am trying to read an unpartitioned parquet file using “ray.data.read_parquet”, but I am getting the following error:
Read more >Open a multi-file dataset — open_dataset • Arrow R Package
Arrow Datasets allow you to query against data that has been split across multiple files. This sharding of data may indicate partitioning, ...
Read more >Speed up data analytics and wrangling with Parquet files - Posit
Speed: efficiently reads data in less time; Interoperability: can be read by many different languages. Converting our datasets from row-based ( ...
Read more >Scaling to large datasets — pandas 1.1.5 documentation
Pandas provides data structures for in-memory analytics, ... Suppose our raw dataset on disk has many columns: ... Dask Name: read-parquet, 12 tasks....
Read more >Parquet format - Azure Data Factory & Azure Synapse
White space in column name is not supported for Parquet files. Below is an example of Parquet dataset on Azure Blob Storage: JSON...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
@clarkzinzow the schema is:
This was determined to be an (Arrow-level) issue with Parquet dataset partitioning when including the partition field (
tag
) in both the file path (via the Hive partitioning scheme) and in the file itself. Namely:This caused a column type conflict between the partition
tag
field and the in-filetag
field. Since Parquet datasets shouldn’t duplicate the partition field in the file, and since this can be worked around by ignoring partitioningI think that we can close this!