Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[SUPPORT] Hudi Delta Streamer doesn't recognize hive style date partition on S3

See original GitHub issue

Describe the problem you faced

Hudi Delta Streamer doesn’t recognize date hive style partitions (e.g. date=2022-01-01) on my dataset. I’m wondering if I’m missing some configuration or if I’m doing something wrong.

To Reproduce

Steps to reproduce the behavior:

Run the following script to create sample data

from pyspark.sql import SparkSession
from datetime import date

data = [
    {'date': date(2022, 1, 5), 'ts': '2022-04-10T09:47:54+00:00', 'name': 'Fake Name 1', 'email': 'fakename1@email.com'},
    {'date': date(2022, 1, 4), 'ts': '2022-04-10T09:47:54+00:00', 'name': 'Fake Name 2', 'email': 'fakename2@email.com'},
    {'date': date(2022, 1, 3), 'ts': '2022-04-10T09:47:54+00:00', 'name': 'Fake Name 3', 'email': 'fakename3@email.com'},
    {'date': date(2022, 2, 5), 'ts': '2022-04-10T09:47:54+00:00', 'name': 'Fake Name 4', 'email': 'fakename4@email.com'},
    {'date': date(2022, 3, 5), 'ts': '2022-04-10T09:47:54+00:00', 'name': 'Fake Name 5', 'email': 'fakename5@email.com'},
    {'date': date(2022, 5, 10), 'ts': '2022-04-10T09:47:54+00:00', 'name': 'Fake Name 6', 'email': 'fakename6@email.com'},
    {'date': date(2022, 5, 1), 'ts': '2022-04-10T09:47:54+00:00', 'name': 'Fake Name 7', 'email': 'fakename7@email.com'},
]

spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame(data)
df.write.partitionBy('date').parquet('sample-data')

Create a S3 bucket (e.g. hudi-issue-raw-zone on this example) w/ server side encryption (e.g. SSE-S3 on this example) and upload the sample-data. Create a second bucket to simulate standard zone (e.g. hudi-issue-standard-zone on this example)
Create an EMR cluster with EMR release 6.5.0 (hadoop 3.2.1, hive 3.1.2, spark 3.1.2), in the section AWS Glue Data Catalog settings mark the options Use for Hive table metadata and Use for Spark table metadata. In this case I’m selecting a simple 3 node cluster, 1 master and 2 cores with the instance type m5.xlarge and using a key par to connect, the rest of options are used with default values. The region is us-west-2.
Run the following hudi delta streamer job

spark-submit --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer \
            --jars /usr/lib/spark/external/lib/spark-avro.jar \
            --master yarn \
            --deploy-mode client \
            --conf spark.sql.hive.convertMetastoreParquet=false /usr/lib/hudi/hudi-utilities-bundle.jar \
            --table-type COPY_ON_WRITE \
            --source-ordering-field ts \
            --source-class org.apache.hudi.utilities.sources.ParquetDFSSource \
            --target-table sample_data_complex \
            --target-base-path s3://hudi-issue-standard-zone/sample-data-complex/ \
            --hoodie-conf hoodie.deltastreamer.source.dfs.root=s3://hudi-issue-raw-zone/sample-data/ \
            --hoodie-conf hoodie.datasource.write.recordkey.field=ts,email \
            --hoodie-conf hoodie.datasource.write.hive_style_partitioning=true \
            --op UPSERT \
            --hoodie-conf hoodie.datasource.write.partitionpath.field=date \
            --hoodie-conf hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.ComplexKeyGenerator

This is the output stored on S3:

aws s3 ls s3://hudi-issue-standard-zone/sample-data-complex/
                           PRE .hoodie/
                           PRE date=default/
2022-05-02 19:56:20          0 .hoodie_$folder$
2022-05-02 19:56:59          0 date=default_$folder$

Using the CustomKeyGenerator that works w/ timestamp based partitions:

spark-submit --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer \
            --jars /usr/lib/spark/external/lib/spark-avro.jar \
            --master yarn \
            --deploy-mode client \
            --conf spark.sql.hive.convertMetastoreParquet=false /usr/lib/hudi/hudi-utilities-bundle.jar \
            --table-type COPY_ON_WRITE \
            --source-ordering-field ts \
            --source-class org.apache.hudi.utilities.sources.ParquetDFSSource \
            --target-table sample_data_custom \
            --target-base-path s3://hudi-issue-standard-zone/sample-data-custom/ \
            --hoodie-conf hoodie.deltastreamer.source.dfs.root=s3://hudi-issue-raw-zone/sample-data/ \
            --hoodie-conf hoodie.datasource.write.recordkey.field=ts,email \
            --hoodie-conf hoodie.datasource.write.hive_style_partitioning=true \
            --op UPSERT \
            --hoodie-conf hoodie.datasource.write.partitionpath.field=date:timestamp \
            --hoodie-conf hoodie.deltastreamer.keygen.timebased.timestamp.type=DATE_STRING \
            --hoodie-conf hoodie.deltastreamer.keygen.timebased.input.dateformat="yyyy-MM-dd" \
            --hoodie-conf hoodie.deltastreamer.keygen.timebased.output.dateformat="yyyy-MM-dd" \
            --hoodie-conf hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.CustomKeyGenerator

This is the output stored on S3:

aws s3 ls s3://hudi-issue-standard-zone/sample-data-custom/
                           PRE .hoodie/
                           PRE date=1970-01-01/
2022-05-02 19:58:48          0 .hoodie_$folder$
2022-05-02 19:59:26          0 date=1970-01-01_$folder$

Expected behavior

I would expect to have the following output:

aws s3 ls s3://hudi-issue-standard-zone/sample-data-complex/
                           PRE .hoodie/
                           PRE date=2022-01-03/
                           PRE date=2022-01-04/
                           PRE date=2022-01-05/
                           PRE date=2022-02-05/
                           PRE date=2022-03-05/
                           PRE date=2022-05-01/
                           PRE date=2022-05-10/

aws s3 ls s3://hudi-issue-standard-zone/sample-data-custom/
                           PRE .hoodie/
                           PRE date=2022-01-03/
                           PRE date=2022-01-04/
                           PRE date=2022-01-05/
                           PRE date=2022-02-05/
                           PRE date=2022-03-05/
                           PRE date=2022-05-01/
                           PRE date=2022-05-10/

Environment Description

Hudi version : 0.9.0-amzn-1
Spark version : 3.1.2
Hive version : 3.1.2
Hadoop version : 3.2.1
Storage (HDFS/S3/GCS…) : S3
Running on Docker? (yes/no) : No

Stacktrace

There is no stack trace in this case, just an unexpected value.

Issue Analytics

State:
Created a year ago
Reactions:1
Comments:8 (5 by maintainers)

Top GitHub Comments

1reaction

leobiscassicommented, May 4, 2022

@yihua nice, I’ll work on this and submit a PR, thanks. 👍🏽

1reaction

yihuacommented, May 3, 2022

Feel free to close the issue if all good.

Top Results From Across the Web

All Configurations | Apache Hudi

These configs control the Hudi Spark Datasource, providing ability to define keys/partitioning, pick out the write operation, specify how to merge records or ......

New features from Apache Hudi 0.9.0 on Amazon EMR

Apache Hudi is an open-source transactional data lake framework that greatly simplifies incremental data processing and data pipeline ...

The Art of Building Open Data Lakes with Apache Hudi, Kafka ...

DeltaStreamer syncs Hudi tables and partitions to Apache Hive running on Amazon EMR;; Queries are executed against Apache Hive Metastore or ...

A Thorough Comparison of Delta Lake, Iceberg and Hudi

Along with Hive Metastore these table formats are trying to solve problems that stand in traditional data lake for a long time with...

Data Lake Demo using Change Data Capture (CDC) on AWS

Hudi DeltaStreamer is run on Amazon EMR. As a spark application, it reads files from the S3 bucket and upserts Hudi records to...