[SUPPORT] Deltastreamer updates not supporting the addition of new columns
See original GitHub issueDescribe the problem you faced
Currently using the delatstreamer to ingested from one S3 bucket to another. In Hudi v10 I would use the upsert operation in the delatstreamer. When a new column was added to the schema the target table would reflect that.
However in Hudi 0.11.1 using the insert operation, schema changes are not reflected in the target table - specifically the addition of nullable columns. Other important notes, I added the metadata table and column stat indexes.
To Reproduce
Steps to reproduce the behavior:
- Start the deltastreamer using the script below
- Add a new nullable column
- Query from the target table for the new column
spark-submit \
--jars opt/spark/jars/hudi-utilities-bundle.jar,/opt/spark/jars/hadoop-aws.jar,/opt/spark/jars/aws-java-sdk.jar \
--master spark://spark-master:7077 \
--total-executor-cores 20 \
--executor-memory 4g \
--conf spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem \
--class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer opt/spark/jars/hudi-utilities-bundle.jar \
--source-class org.apache.hudi.utilities.sources.ParquetDFSSource \
--target-table per_tick_stats \
--table-type COPY_ON_WRITE \
--min-sync-interval-seconds 30 \
--source-limit 250000000 \
--continuous \
--source-ordering-field $3 \
--target-base-path $2 \
--hoodie-conf hoodie.deltastreamer.source.dfs.root=$1 \
--hoodie-conf hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.ComplexKeyGenerator \
--hoodie-conf hoodie.datasource.write.recordkey.field=$4 \
--hoodie-conf hoodie.datasource.write.precombine.field=$3 \
--hoodie-conf hoodie.clustering.plan.strategy.sort.columns=$5 \
--hoodie-conf hoodie.datasource.write.partitionpath.field=$6 \
--hoodie-conf hoodie.clustering.inline=true \
--hoodie-conf hoodie.clustering.plan.strategy.small.file.limit=100000000 \
--hoodie-conf hoodie.clustering.inline.max.commits=4 \
--hoodie-conf hoodie.metadata.enable=true \
--hoodie-conf hoodie.metadata.index.column.stats.enable=true \
--op INSERT
./deltastreamer.sh s3a://simian-example-prod-output/stats/ingesting s3a://simian-example-prod-output/stats/querying STATOVYGIYLUMVSF6YLU STATONUW25LMMF2GS33OL5ZHK3S7NFSA____,STATONUW2X3UNFWWK___ STATONUW25LMMF2GS33OL5ZHK3S7NFSA____,STATMJQXIY3IL5ZHK3S7NFSA____
Expected behavior
New nullable column should be present in the target table
Environment Description
-
Hudi version : 0.11.1
-
Spark version : 3.1.2
-
Hive version : 3.2.0
-
Storage (HDFS/S3/GCS…) : S3
-
Running on Docker? (yes/no) : yes
Additional context
Initially used upsert but was unable to continue using it because of this issue: https://github.com/apache/hudi/issues/6015 Stacktrace
Add the stacktrace of the error.
Issue Analytics
- State:
- Created a year ago
- Comments:9 (3 by maintainers)
Top GitHub Comments
Option 2 worked for me! Set hoodie.metadata.enable to false in Deltastreamer and wait for a few commits so that metadata table is deleted completely (no .hoodie/metadata folder), and then re-enable the metadata table.
This issue was resolved by instead remove asterisk:
stat_data_frame = (session.read.format("hudi").option("hoodie.datasource.write.reconcile.schema", "true").load("s3a://example-prod-output/stats/querying"))