question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[Datasets] [Bug] ray.data.read_parquet failed with multiple files

See original GitHub issue

Ray version:

ray, version 1.9.1

pyarrow version

(ray) ➜  ray-test pip show pyarrow
Name: pyarrow
Version: 4.0.1
Summary: Python library for Apache Arrow
Home-page: https://arrow.apache.org/
Author:
Author-email:
License: Apache License, Version 2.0
Location: /usr/local/Caskroom/miniconda/base/envs/ray/lib/python3.7/site-packages
Requires: numpy
Required-by:

paths = [ … ] # a list of s3 files

this is okay

ds = ray.data.read_parquet(paths[0])
ds.schema()

this is also okay

ds = ray.data.read_parquet(paths[1])
ds.schema()

the schemas are exactly the same. but this is not okay

ds = ray.data.read_parquet(paths[0:2])

with error msg:

2022-01-28 10:31:54,543	INFO services.py:1340 -- View the Ray dashboard at http://127.0.0.1:8266
Traceback (most recent call last):
  File "/home/ray/ray_data.py", line 16, in <module>
    ds = ray.data.read_parquet([paths[1], paths[2]])
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/data/read_api.py", line 278, in read_parquet
    **arrow_parquet_args)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/data/read_api.py", line 161, in read_datasource
    read_tasks = datasource.prepare_read(parallelism, **read_args)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/data/datasource/parquet_datasource.py", line 65, in prepare_read
    use_legacy_dataset=False)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/pyarrow/parquet.py", line 1239, in __new__
    metadata_nthreads=metadata_nthreads)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/pyarrow/parquet.py", line 1561, in __init__
    ignore_prefixes=ignore_prefixes)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/pyarrow/dataset.py", line 659, in dataset
    return _filesystem_dataset(source, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/pyarrow/dataset.py", line 411, in _filesystem_dataset
    return factory.finish(schema)
  File "pyarrow/_dataset.pyx", line 2201, in pyarrow._dataset.DatasetFactory.finish
  File "pyarrow/error.pxi", line 141, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 97, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Unable to merge: Field tag has incompatible types: int32 vs dictionary<values=int32, indices=int32, ordered=0>

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:6 (3 by maintainers)

github_iconTop GitHub Comments

1reaction
coinzergecommented, Feb 1, 2022

@clarkzinzow the schema is:

hash: string not null
  -- field metadata --
  PARQUET:field_id: '1'
block_hash: string not null
  -- field metadata --
  PARQUET:field_id: '2'
block_number: int64 not null
  -- field metadata --
  PARQUET:field_id: '3'
transaction_index: int64 not null
  -- field metadata --
  PARQUET:field_id: '4'
from_address: string not null
  -- field metadata --
  PARQUET:field_id: '5'
to_address: string not null
  -- field metadata --
  PARQUET:field_id: '6'
nonce: int64 not null
  -- field metadata --
  PARQUET:field_id: '7'
value: string not null
  -- field metadata --
  PARQUET:field_id: '8'
gas: int64 not null
  -- field metadata --
  PARQUET:field_id: '9'
gas_price: int64 not null
  -- field metadata --
  PARQUET:field_id: '10'
input: string not null
  -- field metadata --
  PARQUET:field_id: '11'
block_timestamp_seconds_utc: int64 not null
  -- field metadata --
  PARQUET:field_id: '12'
transaction_type: int64 not null
  -- field metadata --
  PARQUET:field_id: '13'
tag: int32 not null
  -- field metadata --
  PARQUET:field_id: '14'
max_fee_per_gas: int64
  -- field metadata --
  PARQUET:field_id: '15'
max_priority_fee_per_gas: int64
  -- field metadata --
  PARQUET:field_id: '16'
0reactions
clarkzinzowcommented, Feb 16, 2022

This was determined to be an (Arrow-level) issue with Parquet dataset partitioning when including the partition field (tag) in both the file path (via the Hive partitioning scheme) and in the file itself. Namely:

Note: the partition columns in the original table will have their types converted to Arrow dictionary types (pandas categorical) on load.

This caused a column type conflict between the partition tag field and the in-file tag field. Since Parquet datasets shouldn’t duplicate the partition field in the file, and since this can be worked around by ignoring partitioning

ds = ray.data.read_parquet([...], dataset_kwargs=dict(partitioning=None))

I think that we can close this!

Read more comments on GitHub >

github_iconTop Results From Across the Web

Ray Dataset Cannot Read Parquet File
Hi everyone,. I am trying to read an unpartitioned parquet file using “ray.data.read_parquet”, but I am getting the following error:
Read more >
Open a multi-file dataset — open_dataset • Arrow R Package
Arrow Datasets allow you to query against data that has been split across multiple files. This sharding of data may indicate partitioning, ...
Read more >
Speed up data analytics and wrangling with Parquet files - Posit
Speed: efficiently reads data in less time; Interoperability: can be read by many different languages. Converting our datasets from row-based ( ...
Read more >
Scaling to large datasets — pandas 1.1.5 documentation
Pandas provides data structures for in-memory analytics, ... Suppose our raw dataset on disk has many columns: ... Dask Name: read-parquet, 12 tasks....
Read more >
Parquet format - Azure Data Factory & Azure Synapse
White space in column name is not supported for Parquet files. Below is an example of Parquet dataset on Azure Blob Storage: JSON...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found