[SUPPORT] Hudi Delta Streamer doesn't recognize hive style date partition on S3
See original GitHub issueDescribe the problem you faced
Hudi Delta Streamer doesn’t recognize date hive style partitions (e.g. date=2022-01-01
) on my dataset. I’m wondering if I’m missing some configuration or if I’m doing something wrong.
To Reproduce
Steps to reproduce the behavior:
- Run the following script to create sample data
from pyspark.sql import SparkSession
from datetime import date
data = [
{'date': date(2022, 1, 5), 'ts': '2022-04-10T09:47:54+00:00', 'name': 'Fake Name 1', 'email': 'fakename1@email.com'},
{'date': date(2022, 1, 4), 'ts': '2022-04-10T09:47:54+00:00', 'name': 'Fake Name 2', 'email': 'fakename2@email.com'},
{'date': date(2022, 1, 3), 'ts': '2022-04-10T09:47:54+00:00', 'name': 'Fake Name 3', 'email': 'fakename3@email.com'},
{'date': date(2022, 2, 5), 'ts': '2022-04-10T09:47:54+00:00', 'name': 'Fake Name 4', 'email': 'fakename4@email.com'},
{'date': date(2022, 3, 5), 'ts': '2022-04-10T09:47:54+00:00', 'name': 'Fake Name 5', 'email': 'fakename5@email.com'},
{'date': date(2022, 5, 10), 'ts': '2022-04-10T09:47:54+00:00', 'name': 'Fake Name 6', 'email': 'fakename6@email.com'},
{'date': date(2022, 5, 1), 'ts': '2022-04-10T09:47:54+00:00', 'name': 'Fake Name 7', 'email': 'fakename7@email.com'},
]
spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame(data)
df.write.partitionBy('date').parquet('sample-data')
- Create a S3 bucket (e.g.
hudi-issue-raw-zone
on this example) w/ server side encryption (e.g.SSE-S3
on this example) and upload thesample-data
. Create a second bucket to simulate standard zone (e.g.hudi-issue-standard-zone
on this example) - Create an EMR cluster with EMR release 6.5.0 (
hadoop 3.2.1
,hive 3.1.2
,spark 3.1.2
), in the sectionAWS Glue Data Catalog settings
mark the optionsUse for Hive table metadata
andUse for Spark table metadata
. In this case I’m selecting a simple 3 node cluster, 1 master and 2 cores with the instance typem5.xlarge
and using a key par to connect, the rest of options are used with default values. The region isus-west-2
. - Run the following hudi delta streamer job
spark-submit --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer \
--jars /usr/lib/spark/external/lib/spark-avro.jar \
--master yarn \
--deploy-mode client \
--conf spark.sql.hive.convertMetastoreParquet=false /usr/lib/hudi/hudi-utilities-bundle.jar \
--table-type COPY_ON_WRITE \
--source-ordering-field ts \
--source-class org.apache.hudi.utilities.sources.ParquetDFSSource \
--target-table sample_data_complex \
--target-base-path s3://hudi-issue-standard-zone/sample-data-complex/ \
--hoodie-conf hoodie.deltastreamer.source.dfs.root=s3://hudi-issue-raw-zone/sample-data/ \
--hoodie-conf hoodie.datasource.write.recordkey.field=ts,email \
--hoodie-conf hoodie.datasource.write.hive_style_partitioning=true \
--op UPSERT \
--hoodie-conf hoodie.datasource.write.partitionpath.field=date \
--hoodie-conf hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.ComplexKeyGenerator
This is the output stored on S3:
aws s3 ls s3://hudi-issue-standard-zone/sample-data-complex/
PRE .hoodie/
PRE date=default/
2022-05-02 19:56:20 0 .hoodie_$folder$
2022-05-02 19:56:59 0 date=default_$folder$
Using the CustomKeyGenerator
that works w/ timestamp based partitions:
spark-submit --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer \
--jars /usr/lib/spark/external/lib/spark-avro.jar \
--master yarn \
--deploy-mode client \
--conf spark.sql.hive.convertMetastoreParquet=false /usr/lib/hudi/hudi-utilities-bundle.jar \
--table-type COPY_ON_WRITE \
--source-ordering-field ts \
--source-class org.apache.hudi.utilities.sources.ParquetDFSSource \
--target-table sample_data_custom \
--target-base-path s3://hudi-issue-standard-zone/sample-data-custom/ \
--hoodie-conf hoodie.deltastreamer.source.dfs.root=s3://hudi-issue-raw-zone/sample-data/ \
--hoodie-conf hoodie.datasource.write.recordkey.field=ts,email \
--hoodie-conf hoodie.datasource.write.hive_style_partitioning=true \
--op UPSERT \
--hoodie-conf hoodie.datasource.write.partitionpath.field=date:timestamp \
--hoodie-conf hoodie.deltastreamer.keygen.timebased.timestamp.type=DATE_STRING \
--hoodie-conf hoodie.deltastreamer.keygen.timebased.input.dateformat="yyyy-MM-dd" \
--hoodie-conf hoodie.deltastreamer.keygen.timebased.output.dateformat="yyyy-MM-dd" \
--hoodie-conf hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.CustomKeyGenerator
This is the output stored on S3:
aws s3 ls s3://hudi-issue-standard-zone/sample-data-custom/
PRE .hoodie/
PRE date=1970-01-01/
2022-05-02 19:58:48 0 .hoodie_$folder$
2022-05-02 19:59:26 0 date=1970-01-01_$folder$
Expected behavior
I would expect to have the following output:
aws s3 ls s3://hudi-issue-standard-zone/sample-data-complex/
PRE .hoodie/
PRE date=2022-01-03/
PRE date=2022-01-04/
PRE date=2022-01-05/
PRE date=2022-02-05/
PRE date=2022-03-05/
PRE date=2022-05-01/
PRE date=2022-05-10/
Or
aws s3 ls s3://hudi-issue-standard-zone/sample-data-custom/
PRE .hoodie/
PRE date=2022-01-03/
PRE date=2022-01-04/
PRE date=2022-01-05/
PRE date=2022-02-05/
PRE date=2022-03-05/
PRE date=2022-05-01/
PRE date=2022-05-10/
Environment Description
-
Hudi version : 0.9.0-amzn-1
-
Spark version : 3.1.2
-
Hive version : 3.1.2
-
Hadoop version : 3.2.1
-
Storage (HDFS/S3/GCS…) : S3
-
Running on Docker? (yes/no) : No
Stacktrace
There is no stack trace in this case, just an unexpected value.
Issue Analytics
- State:
- Created a year ago
- Reactions:1
- Comments:8 (5 by maintainers)
Top Results From Across the Web
All Configurations | Apache Hudi
These configs control the Hudi Spark Datasource, providing ability to define keys/partitioning, pick out the write operation, specify how to merge records or ......
Read more >New features from Apache Hudi 0.9.0 on Amazon EMR
Apache Hudi is an open-source transactional data lake framework that greatly simplifies incremental data processing and data pipeline ...
Read more >The Art of Building Open Data Lakes with Apache Hudi, Kafka ...
DeltaStreamer syncs Hudi tables and partitions to Apache Hive running on Amazon EMR;; Queries are executed against Apache Hive Metastore or ...
Read more >A Thorough Comparison of Delta Lake, Iceberg and Hudi
Along with Hive Metastore these table formats are trying to solve problems that stand in traditional data lake for a long time with...
Read more >Data Lake Demo using Change Data Capture (CDC) on AWS
Hudi DeltaStreamer is run on Amazon EMR. As a spark application, it reads files from the S3 bucket and upserts Hudi records to...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
@yihua nice, I’ll work on this and submit a PR, thanks. 👍🏽
Feel free to close the issue if all good.