How to read existing hoodie written data from `S3` using `AWS Glue DynamicFrame` class. Failing with error with below error: An error occurred while calling o84.getDynamicFrame. s3://xxxx/.hoodie/202212312312.commit is not a Parquet file. expected magic number at tail
See original GitHub issueDescribe the problem you faced
Error found while reading data written using Hudi in a S3 prefix.
A clear and concise description of the problem.
We are writing data to S3 using AWS glue with Hudi libraries. The issue is when we try to read
the already hudi
written data from this S3 prefix its erroring out.
To Reproduce
Steps to reproduce the behavior:
- Write some data via AWS Glue Job script to S3 using Hudi apis and libraries.
- Try to read them back using DynamicFrame class with code as mentioned below
- It will be erroring out with failure
import sys
from awsglue.transforms import Join
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
glueContext = GlueContext(SparkContext.getOrCreate())
data = glueContext.create_dynamic_frame.from_catalog(database="ingestion_hudi_db_392d4f60", table_name="ingestion_details")
print("Count: ", data.count())
data.printSchema()
Expected behavior
A clear and concise description of what you expected to happen. We should be able to read back the Hudi written data using AWS glue dynamicFrame class. Just as we are able to write it. Sample code of how data is written to S3 in hudi format using DynamicFrame class
glueContext.write_dynamic_frame.from_options(frame=DynamicFrame.fromDF(inputDf, glueContext, "inputDf"),
connection_type="marketplace.spark",
connection_options=combinedConf)
Environment Description
- Hudi version : 0.10.0
- Spark version : 3.1
- Storage (HDFS/S3/GCS…) : S3
- Running on Docker? (yes/no) : no
Additional context
Add any other context about the problem here. If we find a solution to this issue it will also resolve #5880 Stacktrace
Add the stacktrace of the error.
An error occurred while calling o84.getDynamicFrame. s3://hudi-bucket-392d4f60/ingestion/.hoodie/20220616185638342.commit is not a Parquet file. expected magic number at tail [80, 65, 82, 49] but found [32, 125, 10, 125]
Issue Analytics
- State:
- Created a year ago
- Comments:7 (3 by maintainers)
Top GitHub Comments
Below are two workable solutions to get around this problem. They both involve adding
additional_options
to the S3 connectionType.additional_options={"exclusions": "[\"s3://bucket/folder/.hoodie/*\"]"}
additional_options={"exclusions": "[\"**.commit\",\"**.inflight\",\"**.properties\"]"}
@gtwuser
thanks @jharringtonCoupons closing it for now.