question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[SUPPORT] Hudi Delta Streamer doesn't recognize hive style date partition on S3

See original GitHub issue

Describe the problem you faced

Hudi Delta Streamer doesn’t recognize date hive style partitions (e.g. date=2022-01-01) on my dataset. I’m wondering if I’m missing some configuration or if I’m doing something wrong.

To Reproduce

Steps to reproduce the behavior:

  1. Run the following script to create sample data
from pyspark.sql import SparkSession
from datetime import date

data = [
    {'date': date(2022, 1, 5), 'ts': '2022-04-10T09:47:54+00:00', 'name': 'Fake Name 1', 'email': 'fakename1@email.com'},
    {'date': date(2022, 1, 4), 'ts': '2022-04-10T09:47:54+00:00', 'name': 'Fake Name 2', 'email': 'fakename2@email.com'},
    {'date': date(2022, 1, 3), 'ts': '2022-04-10T09:47:54+00:00', 'name': 'Fake Name 3', 'email': 'fakename3@email.com'},
    {'date': date(2022, 2, 5), 'ts': '2022-04-10T09:47:54+00:00', 'name': 'Fake Name 4', 'email': 'fakename4@email.com'},
    {'date': date(2022, 3, 5), 'ts': '2022-04-10T09:47:54+00:00', 'name': 'Fake Name 5', 'email': 'fakename5@email.com'},
    {'date': date(2022, 5, 10), 'ts': '2022-04-10T09:47:54+00:00', 'name': 'Fake Name 6', 'email': 'fakename6@email.com'},
    {'date': date(2022, 5, 1), 'ts': '2022-04-10T09:47:54+00:00', 'name': 'Fake Name 7', 'email': 'fakename7@email.com'},
]

spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame(data)
df.write.partitionBy('date').parquet('sample-data')
  1. Create a S3 bucket (e.g. hudi-issue-raw-zone on this example) w/ server side encryption (e.g. SSE-S3 on this example) and upload the sample-data. Create a second bucket to simulate standard zone (e.g. hudi-issue-standard-zone on this example)
  2. Create an EMR cluster with EMR release 6.5.0 (hadoop 3.2.1, hive 3.1.2, spark 3.1.2), in the section AWS Glue Data Catalog settings mark the options Use for Hive table metadata and Use for Spark table metadata. In this case I’m selecting a simple 3 node cluster, 1 master and 2 cores with the instance type m5.xlarge and using a key par to connect, the rest of options are used with default values. The region is us-west-2.
  3. Run the following hudi delta streamer job
spark-submit --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer \
            --jars /usr/lib/spark/external/lib/spark-avro.jar \
            --master yarn \
            --deploy-mode client \
            --conf spark.sql.hive.convertMetastoreParquet=false /usr/lib/hudi/hudi-utilities-bundle.jar \
            --table-type COPY_ON_WRITE \
            --source-ordering-field ts \
            --source-class org.apache.hudi.utilities.sources.ParquetDFSSource \
            --target-table sample_data_complex \
            --target-base-path s3://hudi-issue-standard-zone/sample-data-complex/ \
            --hoodie-conf hoodie.deltastreamer.source.dfs.root=s3://hudi-issue-raw-zone/sample-data/ \
            --hoodie-conf hoodie.datasource.write.recordkey.field=ts,email \
            --hoodie-conf hoodie.datasource.write.hive_style_partitioning=true \
            --op UPSERT \
            --hoodie-conf hoodie.datasource.write.partitionpath.field=date \
            --hoodie-conf hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.ComplexKeyGenerator

This is the output stored on S3:

aws s3 ls s3://hudi-issue-standard-zone/sample-data-complex/
                           PRE .hoodie/
                           PRE date=default/
2022-05-02 19:56:20          0 .hoodie_$folder$
2022-05-02 19:56:59          0 date=default_$folder$

Using the CustomKeyGenerator that works w/ timestamp based partitions:

spark-submit --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer \
            --jars /usr/lib/spark/external/lib/spark-avro.jar \
            --master yarn \
            --deploy-mode client \
            --conf spark.sql.hive.convertMetastoreParquet=false /usr/lib/hudi/hudi-utilities-bundle.jar \
            --table-type COPY_ON_WRITE \
            --source-ordering-field ts \
            --source-class org.apache.hudi.utilities.sources.ParquetDFSSource \
            --target-table sample_data_custom \
            --target-base-path s3://hudi-issue-standard-zone/sample-data-custom/ \
            --hoodie-conf hoodie.deltastreamer.source.dfs.root=s3://hudi-issue-raw-zone/sample-data/ \
            --hoodie-conf hoodie.datasource.write.recordkey.field=ts,email \
            --hoodie-conf hoodie.datasource.write.hive_style_partitioning=true \
            --op UPSERT \
            --hoodie-conf hoodie.datasource.write.partitionpath.field=date:timestamp \
            --hoodie-conf hoodie.deltastreamer.keygen.timebased.timestamp.type=DATE_STRING \
            --hoodie-conf hoodie.deltastreamer.keygen.timebased.input.dateformat="yyyy-MM-dd" \
            --hoodie-conf hoodie.deltastreamer.keygen.timebased.output.dateformat="yyyy-MM-dd" \
            --hoodie-conf hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.CustomKeyGenerator

This is the output stored on S3:

aws s3 ls s3://hudi-issue-standard-zone/sample-data-custom/
                           PRE .hoodie/
                           PRE date=1970-01-01/
2022-05-02 19:58:48          0 .hoodie_$folder$
2022-05-02 19:59:26          0 date=1970-01-01_$folder$

Expected behavior

I would expect to have the following output:

aws s3 ls s3://hudi-issue-standard-zone/sample-data-complex/
                           PRE .hoodie/
                           PRE date=2022-01-03/
                           PRE date=2022-01-04/
                           PRE date=2022-01-05/
                           PRE date=2022-02-05/
                           PRE date=2022-03-05/
                           PRE date=2022-05-01/
                           PRE date=2022-05-10/

Or

aws s3 ls s3://hudi-issue-standard-zone/sample-data-custom/
                           PRE .hoodie/
                           PRE date=2022-01-03/
                           PRE date=2022-01-04/
                           PRE date=2022-01-05/
                           PRE date=2022-02-05/
                           PRE date=2022-03-05/
                           PRE date=2022-05-01/
                           PRE date=2022-05-10/

Environment Description

  • Hudi version : 0.9.0-amzn-1

  • Spark version : 3.1.2

  • Hive version : 3.1.2

  • Hadoop version : 3.2.1

  • Storage (HDFS/S3/GCS…) : S3

  • Running on Docker? (yes/no) : No

Stacktrace

There is no stack trace in this case, just an unexpected value.

Issue Analytics

  • State:closed
  • Created a year ago
  • Reactions:1
  • Comments:8 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
leobiscassicommented, May 4, 2022

@yihua nice, I’ll work on this and submit a PR, thanks. 👍🏽

1reaction
yihuacommented, May 3, 2022

Feel free to close the issue if all good.

Read more comments on GitHub >

github_iconTop Results From Across the Web

All Configurations | Apache Hudi
These configs control the Hudi Spark Datasource, providing ability to define keys/partitioning, pick out the write operation, specify how to merge records or ......
Read more >
New features from Apache Hudi 0.9.0 on Amazon EMR
Apache Hudi is an open-source transactional data lake framework that greatly simplifies incremental data processing and data pipeline ...
Read more >
The Art of Building Open Data Lakes with Apache Hudi, Kafka ...
DeltaStreamer syncs Hudi tables and partitions to Apache Hive running on Amazon EMR;; Queries are executed against Apache Hive Metastore or ...
Read more >
A Thorough Comparison of Delta Lake, Iceberg and Hudi
Along with Hive Metastore these table formats are trying to solve problems that stand in traditional data lake for a long time with...
Read more >
Data Lake Demo using Change Data Capture (CDC) on AWS
Hudi DeltaStreamer is run on Amazon EMR. As a spark application, it reads files from the S3 bucket and upserts Hudi records to...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found