question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

How to read existing hoodie written data from `S3` using `AWS Glue DynamicFrame` class. Failing with error with below error: An error occurred while calling o84.getDynamicFrame. s3://xxxx/.hoodie/202212312312.commit is not a Parquet file. expected magic number at tail

See original GitHub issue

Describe the problem you faced Error found while reading data written using Hudi in a S3 prefix. A clear and concise description of the problem. We are writing data to S3 using AWS glue with Hudi libraries. The issue is when we try to read the already hudi written data from this S3 prefix its erroring out.

To Reproduce

Steps to reproduce the behavior:

  1. Write some data via AWS Glue Job script to S3 using Hudi apis and libraries.
  2. Try to read them back using DynamicFrame class with code as mentioned below
  3. It will be erroring out with failure
import sys
from awsglue.transforms import Join
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job

glueContext = GlueContext(SparkContext.getOrCreate())

data = glueContext.create_dynamic_frame.from_catalog(database="ingestion_hudi_db_392d4f60", table_name="ingestion_details")
print("Count: ", data.count())
data.printSchema()

Expected behavior

A clear and concise description of what you expected to happen. We should be able to read back the Hudi written data using AWS glue dynamicFrame class. Just as we are able to write it. Sample code of how data is written to S3 in hudi format using DynamicFrame class

glueContext.write_dynamic_frame.from_options(frame=DynamicFrame.fromDF(inputDf, glueContext, "inputDf"),
                                                     connection_type="marketplace.spark",
                                                     connection_options=combinedConf)

Environment Description

  • Hudi version : 0.10.0
  • Spark version : 3.1
  • Storage (HDFS/S3/GCS…) : S3
  • Running on Docker? (yes/no) : no

Additional context

Add any other context about the problem here. If we find a solution to this issue it will also resolve #5880 Stacktrace

Add the stacktrace of the error.

An error occurred while calling o84.getDynamicFrame. s3://hudi-bucket-392d4f60/ingestion/.hoodie/20220616185638342.commit is not a Parquet file. expected magic number at tail [80, 65, 82, 49] but found [32, 125, 10, 125]

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:7 (3 by maintainers)

github_iconTop GitHub Comments

1reaction
jharringtonCouponscommented, Sep 23, 2022

Below are two workable solutions to get around this problem. They both involve adding additional_options to the S3 connectionType.

  • Exclude folder:

additional_options={"exclusions": "[\"s3://bucket/folder/.hoodie/*\"]"}

  • Exclude file(s):

additional_options={"exclusions": "[\"**.commit\",\"**.inflight\",\"**.properties\"]"}

@gtwuser

glueContext.write_dynamic_frame.from_options(frame=DynamicFrame.fromDF(inputDf, glueContext, "inputDf"),
                                                     connection_type="marketplace.spark",
                                                     connection_options=combinedConf,
						     additional_options={"exclusions": "[\"**.commit\",\"**.inflight\",\"**.properties\"]"})
0reactions
gtwusercommented, Oct 1, 2022

thanks @jharringtonCoupons closing it for now.

Read more comments on GitHub >

github_iconTop Results From Across the Web

No results found

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found