Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[Datasets] [Bug] ray.data.read_parquet failed with multiple files

See original GitHub issue

Ray version:

ray, version 1.9.1

pyarrow version

(ray) ➜  ray-test pip show pyarrow
Name: pyarrow
Version: 4.0.1
Summary: Python library for Apache Arrow
Home-page: https://arrow.apache.org/
Author:
Author-email:
License: Apache License, Version 2.0
Location: /usr/local/Caskroom/miniconda/base/envs/ray/lib/python3.7/site-packages
Requires: numpy
Required-by:

paths = [ … ] # a list of s3 files

this is okay

ds = ray.data.read_parquet(paths[0])
ds.schema()

this is also okay

ds = ray.data.read_parquet(paths[1])
ds.schema()

the schemas are exactly the same. but this is not okay

ds = ray.data.read_parquet(paths[0:2])

with error msg:

2022-01-28 10:31:54,543	INFO services.py:1340 -- View the Ray dashboard at http://127.0.0.1:8266
Traceback (most recent call last):
  File "/home/ray/ray_data.py", line 16, in <module>
    ds = ray.data.read_parquet([paths[1], paths[2]])
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/data/read_api.py", line 278, in read_parquet
    **arrow_parquet_args)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/data/read_api.py", line 161, in read_datasource
    read_tasks = datasource.prepare_read(parallelism, **read_args)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/data/datasource/parquet_datasource.py", line 65, in prepare_read
    use_legacy_dataset=False)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/pyarrow/parquet.py", line 1239, in __new__
    metadata_nthreads=metadata_nthreads)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/pyarrow/parquet.py", line 1561, in __init__
    ignore_prefixes=ignore_prefixes)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/pyarrow/dataset.py", line 659, in dataset
    return _filesystem_dataset(source, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/pyarrow/dataset.py", line 411, in _filesystem_dataset
    return factory.finish(schema)
  File "pyarrow/_dataset.pyx", line 2201, in pyarrow._dataset.DatasetFactory.finish
  File "pyarrow/error.pxi", line 141, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 97, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Unable to merge: Field tag has incompatible types: int32 vs dictionary<values=int32, indices=int32, ordered=0>

Issue Analytics

State:
Created 2 years ago
Comments:6 (3 by maintainers)

Top GitHub Comments

1reaction

coinzergecommented, Feb 1, 2022

@clarkzinzow the schema is:

hash: string not null
  -- field metadata --
  PARQUET:field_id: '1'
block_hash: string not null
  -- field metadata --
  PARQUET:field_id: '2'
block_number: int64 not null
  -- field metadata --
  PARQUET:field_id: '3'
transaction_index: int64 not null
  -- field metadata --
  PARQUET:field_id: '4'
from_address: string not null
  -- field metadata --
  PARQUET:field_id: '5'
to_address: string not null
  -- field metadata --
  PARQUET:field_id: '6'
nonce: int64 not null
  -- field metadata --
  PARQUET:field_id: '7'
value: string not null
  -- field metadata --
  PARQUET:field_id: '8'
gas: int64 not null
  -- field metadata --
  PARQUET:field_id: '9'
gas_price: int64 not null
  -- field metadata --
  PARQUET:field_id: '10'
input: string not null
  -- field metadata --
  PARQUET:field_id: '11'
block_timestamp_seconds_utc: int64 not null
  -- field metadata --
  PARQUET:field_id: '12'
transaction_type: int64 not null
  -- field metadata --
  PARQUET:field_id: '13'
tag: int32 not null
  -- field metadata --
  PARQUET:field_id: '14'
max_fee_per_gas: int64
  -- field metadata --
  PARQUET:field_id: '15'
max_priority_fee_per_gas: int64
  -- field metadata --
  PARQUET:field_id: '16'

0reactions

clarkzinzowcommented, Feb 16, 2022

This was determined to be an (Arrow-level) issue with Parquet dataset partitioning when including the partition field (tag) in both the file path (via the Hive partitioning scheme) and in the file itself. Namely:

Note: the partition columns in the original table will have their types converted to Arrow dictionary types (pandas categorical) on load.

This caused a column type conflict between the partition tag field and the in-file tag field. Since Parquet datasets shouldn’t duplicate the partition field in the file, and since this can be worked around by ignoring partitioning

ds = ray.data.read_parquet([...], dataset_kwargs=dict(partitioning=None))

I think that we can close this!

Top Results From Across the Web

Ray Dataset Cannot Read Parquet File

Hi everyone,. I am trying to read an unpartitioned parquet file using “ray.data.read_parquet”, but I am getting the following error:

Open a multi-file dataset — open_dataset • Arrow R Package

Arrow Datasets allow you to query against data that has been split across multiple files. This sharding of data may indicate partitioning, ...

Speed up data analytics and wrangling with Parquet files - Posit

Speed: efficiently reads data in less time; Interoperability: can be read by many different languages. Converting our datasets from row-based ( ...

Scaling to large datasets — pandas 1.1.5 documentation

Pandas provides data structures for in-memory analytics, ... Suppose our raw dataset on disk has many columns: ... Dask Name: read-parquet, 12 tasks....

Parquet format - Azure Data Factory & Azure Synapse

White space in column name is not supported for Parquet files. Below is an example of Parquet dataset on Azure Blob Storage: JSON...