question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[SUPPORT] Deltastreamer updates not supporting the addition of new columns

See original GitHub issue

Describe the problem you faced

Currently using the delatstreamer to ingested from one S3 bucket to another. In Hudi v10 I would use the upsert operation in the delatstreamer. When a new column was added to the schema the target table would reflect that.

However in Hudi 0.11.1 using the insert operation, schema changes are not reflected in the target table - specifically the addition of nullable columns. Other important notes, I added the metadata table and column stat indexes.

To Reproduce

Steps to reproduce the behavior:

  1. Start the deltastreamer using the script below
  2. Add a new nullable column
  3. Query from the target table for the new column
spark-submit \
--jars opt/spark/jars/hudi-utilities-bundle.jar,/opt/spark/jars/hadoop-aws.jar,/opt/spark/jars/aws-java-sdk.jar \
--master spark://spark-master:7077 \
--total-executor-cores 20 \
--executor-memory 4g \
--conf spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem \
--class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer opt/spark/jars/hudi-utilities-bundle.jar \
--source-class org.apache.hudi.utilities.sources.ParquetDFSSource \
--target-table per_tick_stats \
--table-type COPY_ON_WRITE \
--min-sync-interval-seconds 30 \
--source-limit 250000000 \
--continuous \
--source-ordering-field $3 \
--target-base-path $2 \
--hoodie-conf hoodie.deltastreamer.source.dfs.root=$1 \
--hoodie-conf hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.ComplexKeyGenerator \
--hoodie-conf hoodie.datasource.write.recordkey.field=$4 \
--hoodie-conf hoodie.datasource.write.precombine.field=$3 \
--hoodie-conf hoodie.clustering.plan.strategy.sort.columns=$5 \
--hoodie-conf hoodie.datasource.write.partitionpath.field=$6 \
--hoodie-conf hoodie.clustering.inline=true \
--hoodie-conf hoodie.clustering.plan.strategy.small.file.limit=100000000 \
--hoodie-conf hoodie.clustering.inline.max.commits=4 \
--hoodie-conf hoodie.metadata.enable=true \
--hoodie-conf hoodie.metadata.index.column.stats.enable=true \
--op INSERT
./deltastreamer.sh s3a://simian-example-prod-output/stats/ingesting s3a://simian-example-prod-output/stats/querying STATOVYGIYLUMVSF6YLU STATONUW25LMMF2GS33OL5ZHK3S7NFSA____,STATONUW2X3UNFWWK___ STATONUW25LMMF2GS33OL5ZHK3S7NFSA____,STATMJQXIY3IL5ZHK3S7NFSA____

Expected behavior

New nullable column should be present in the target table

Environment Description

  • Hudi version : 0.11.1

  • Spark version : 3.1.2

  • Hive version : 3.2.0

  • Storage (HDFS/S3/GCS…) : S3

  • Running on Docker? (yes/no) : yes

Additional context

Initially used upsert but was unable to continue using it because of this issue: https://github.com/apache/hudi/issues/6015 Stacktrace

Add the stacktrace of the error.

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:9 (3 by maintainers)

github_iconTop GitHub Comments

1reaction
rohit-m-99commented, Aug 18, 2022

Option 2 worked for me! Set hoodie.metadata.enable to false in Deltastreamer and wait for a few commits so that metadata table is deleted completely (no .hoodie/metadata folder), and then re-enable the metadata table.

1reaction
rohit-m-99commented, Aug 9, 2022

This issue was resolved by instead remove asterisk:

stat_data_frame = (session.read.format("hudi").option("hoodie.datasource.write.reconcile.schema", "true").load("s3a://example-prod-output/stats/querying"))

Read more comments on GitHub >

github_iconTop Results From Across the Web

Writing Data | Apache Hudi
In this section, we will cover ways to ingest new changes from external sources or even other Hudi tables. The two main tools...
Read more >
New features from Apache Hudi 0.9.0 on Amazon EMR
Apache Hudi maintains metadata by adding additional columns to the datasets. This lets it support upsert/delete operations and various ...
Read more >
Fresher Data Lake on AWS S3 - Robinhood Engineering
We've come a long way from our initial version of Data Lake not only in terms of the ... the schema changes involved...
Read more >
Hudi vs Delta vs Iceberg Lakehouse Feature Comparisons
In addition to CoW, Apache Hudi supports another table storage layout ... partition evolution allows you to update your partitions for new ......
Read more >
基于 Apache Hudi 构建湖上的多租户数据管道 - 开发者头条
In the past, to support record-level updates or inserts (called upserts) and deletes ... Files in this location all have the new column...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found