[SUPPORT] Update hudi table(using SparkSQL) failed when the column contains `null` value in other records
See original GitHub issueDescribe the problem you faced
Update hudi table(using SparkSQL) failed when the column contains null
value in other records, as the following image:
To Reproduce
Steps to reproduce the behavior:
- init table & data
-- create table like this create table hudi.update_null_test_cow ( id bigint, name string, note string, dt string, ts bigint ) using hudi tblproperties( type = 'cow', primaryKey = 'id', preCombineField = 'ts' ) partitioned by (dt) location 's3://my-test-bucket/hudi/update_null_test_cow'; -- insert a record insert into hudi.update_null_test_cow partition(dt = '2022-06-08') select 1 as id, 'john doe' as name, '' as note, 1000 as ts;
- let’s try to update the record, which will be updated successfully
update hudi.update_null_test_cow set note = 'some note' where id = 1;
- add a new record, which contains a
null
valueinsert into hudi.update_null_test_cow partition(dt = '2022-06-08') select 2 as id, 'foobar' as name, null as note, 1000 as ts;
- update the record
id:1
, which will failupdate hudi.update_null_test_cow set note = 'some other note' where id = 1;
Expected behavior
The update query can proceed successfully.
Environment Description
-
Hudi version : 0.10.1
-
Spark version : 3.2.0
-
Hive version : 3.1.2
-
Hadoop version : 3.2.0
-
Storage (HDFS/S3/GCS…) : S3
-
Running on Docker? (yes/no) : no
Additional context
The whole environment is bundled in Amazon EMR 6.6.0.
Stacktrace
Error happens in sql:
update hudi.update_null_test_cow set note = 'some other note' where id = 1
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 171.0 failed 4 times, most recent failure: Lost task 0.3 in stage 171.0 (TID 8477) (executor 11): org.apache.hudi.exception.HoodieUpsertException: Error upserting bucketType UPDATE for partition :0
at org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.handleUpsertPartition(BaseSparkCommitActionExecutor.java:322)
at org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.lambda$execute$ecf5068c$1(BaseSparkCommitActionExecutor.java:174)
at org.apache.spark.api.java.JavaRDDLike.$anonfun$mapPartitionsWithIndex$1(JavaRDDLike.scala:102)
at org.apache.spark.api.java.JavaRDDLike.$anonfun$mapPartitionsWithIndex$1$adapted(JavaRDDLike.scala:102)
at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsWithIndex$2(RDD.scala:915)
at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsWithIndex$2$adapted(RDD.scala:915)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
at org.apache.spark.rdd.RDD.$anonfun$getOrCompute$1(RDD.scala:386)
at org.apache.spark.storage.BlockManager.$anonfun$doPutIterator$1(BlockManager.scala:1498)
at org.apache.spark.storage.BlockManager.org$apache$spark$storage$BlockManager$$doPut(BlockManager.scala:1408)
at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1472)
at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:1295)
at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:384)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:335)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:133)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1474)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750)
Caused by: org.apache.hudi.exception.HoodieException: org.apache.hudi.exception.HoodieException: java.util.concurrent.ExecutionException: org.apache.hudi.exception.HoodieUpsertException: Failed to merge old record into new file for key id:2 from old file s3://my-test-bucket/hudi/update_null_test_cow/dt=2022-06-08/730df40f-7973-48eb-a494-b167030bfd37-0_0-124-6252_20220608044435168.parquet to new file s3://my-test-bucket/hudi/update_null_test_cow/dt=2022-06-08/730df40f-7973-48eb-a494-b167030bfd37-0_0-171-8477_20220608044759733.parquet with writerSchema {
"type" : "record",
"name" : "update_null_test_cow_record",
"namespace" : "hoodie.update_null_test_cow",
"fields" : [ {
"name" : "_hoodie_commit_time",
"type" : [ "null", "string" ],
"doc" : "",
"default" : null
}, {
"name" : "_hoodie_commit_seqno",
"type" : [ "null", "string" ],
"doc" : "",
"default" : null
}, {
"name" : "_hoodie_record_key",
"type" : [ "null", "string" ],
"doc" : "",
"default" : null
}, {
"name" : "_hoodie_partition_path",
"type" : [ "null", "string" ],
"doc" : "",
"default" : null
}, {
"name" : "_hoodie_file_name",
"type" : [ "null", "string" ],
"doc" : "",
"default" : null
}, {
"name" : "id",
"type" : [ "null", "long" ],
"default" : null
}, {
"name" : "name",
"type" : [ "null", "string" ],
"default" : null
}, {
"name" : "note",
"type" : "string"
}, {
"name" : "ts",
"type" : [ "null", "long" ],
"default" : null
}, {
"name" : "dt",
"type" : [ "null", "string" ],
"default" : null
} ]
}
at org.apache.hudi.table.action.commit.SparkMergeHelper.runMerge(SparkMergeHelper.java:102)
at org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.handleUpdateInternal(BaseSparkCommitActionExecutor.java:351)
at org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.handleUpdate(BaseSparkCommitActionExecutor.java:342)
at org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.handleUpsertPartition(BaseSparkCommitActionExecutor.java:315)
... 28 more
Caused by: org.apache.hudi.exception.HoodieException: java.util.concurrent.ExecutionException: org.apache.hudi.exception.HoodieUpsertException: Failed to merge old record into new file for key id:2 from old file s3://my-test-bucket/hudi/update_null_test_cow/dt=2022-06-08/730df40f-7973-48eb-a494-b167030bfd37-0_0-124-6252_20220608044435168.parquet to new file s3://my-test-bucket/hudi/update_null_test_cow/dt=2022-06-08/730df40f-7973-48eb-a494-b167030bfd37-0_0-171-8477_20220608044759733.parquet with writerSchema {
"type" : "record",
"name" : "update_null_test_cow_record",
"namespace" : "hoodie.update_null_test_cow",
"fields" : [ {
"name" : "_hoodie_commit_time",
"type" : [ "null", "string" ],
"doc" : "",
"default" : null
}, {
"name" : "_hoodie_commit_seqno",
"type" : [ "null", "string" ],
"doc" : "",
"default" : null
}, {
"name" : "_hoodie_record_key",
"type" : [ "null", "string" ],
"doc" : "",
"default" : null
}, {
"name" : "_hoodie_partition_path",
"type" : [ "null", "string" ],
"doc" : "",
"default" : null
}, {
"name" : "_hoodie_file_name",
"type" : [ "null", "string" ],
"doc" : "",
"default" : null
}, {
"name" : "id",
"type" : [ "null", "long" ],
"default" : null
}, {
"name" : "name",
"type" : [ "null", "string" ],
"default" : null
}, {
"name" : "note",
"type" : "string"
}, {
"name" : "ts",
"type" : [ "null", "long" ],
"default" : null
}, {
"name" : "dt",
"type" : [ "null", "string" ],
"default" : null
} ]
}
at org.apache.hudi.common.util.queue.BoundedInMemoryExecutor.execute(BoundedInMemoryExecutor.java:147)
at org.apache.hudi.table.action.commit.SparkMergeHelper.runMerge(SparkMergeHelper.java:100)
... 31 more
Caused by: java.util.concurrent.ExecutionException: org.apache.hudi.exception.HoodieUpsertException: Failed to merge old record into new file for key id:2 from old file s3://my-test-bucket/hudi/update_null_test_cow/dt=2022-06-08/730df40f-7973-48eb-a494-b167030bfd37-0_0-124-6252_20220608044435168.parquet to new file s3://my-test-bucket/hudi/update_null_test_cow/dt=2022-06-08/730df40f-7973-48eb-a494-b167030bfd37-0_0-171-8477_20220608044759733.parquet with writerSchema {
"type" : "record",
"name" : "update_null_test_cow_record",
"namespace" : "hoodie.update_null_test_cow",
"fields" : [ {
"name" : "_hoodie_commit_time",
"type" : [ "null", "string" ],
"doc" : "",
"default" : null
}, {
"name" : "_hoodie_commit_seqno",
"type" : [ "null", "string" ],
"doc" : "",
"default" : null
}, {
"name" : "_hoodie_record_key",
"type" : [ "null", "string" ],
"doc" : "",
"default" : null
}, {
"name" : "_hoodie_partition_path",
"type" : [ "null", "string" ],
"doc" : "",
"default" : null
}, {
"name" : "_hoodie_file_name",
"type" : [ "null", "string" ],
"doc" : "",
"default" : null
}, {
"name" : "id",
"type" : [ "null", "long" ],
"default" : null
}, {
"name" : "name",
"type" : [ "null", "string" ],
"default" : null
}, {
"name" : "note",
"type" : "string"
}, {
"name" : "ts",
"type" : [ "null", "long" ],
"default" : null
}, {
"name" : "dt",
"type" : [ "null", "string" ],
"default" : null
} ]
}
at java.util.concurrent.FutureTask.report(FutureTask.java:122)
at java.util.concurrent.FutureTask.get(FutureTask.java:192)
at org.apache.hudi.common.util.queue.BoundedInMemoryExecutor.execute(BoundedInMemoryExecutor.java:141)
... 32 more
Caused by: org.apache.hudi.exception.HoodieUpsertException: Failed to merge old record into new file for key id:2 from old file s3://my-test-bucket/hudi/update_null_test_cow/dt=2022-06-08/730df40f-7973-48eb-a494-b167030bfd37-0_0-124-6252_20220608044435168.parquet to new file s3://my-test-bucket/hudi/update_null_test_cow/dt=2022-06-08/730df40f-7973-48eb-a494-b167030bfd37-0_0-171-8477_20220608044759733.parquet with writerSchema {
"type" : "record",
"name" : "update_null_test_cow_record",
"namespace" : "hoodie.update_null_test_cow",
"fields" : [ {
"name" : "_hoodie_commit_time",
"type" : [ "null", "string" ],
"doc" : "",
"default" : null
}, {
"name" : "_hoodie_commit_seqno",
"type" : [ "null", "string" ],
"doc" : "",
"default" : null
}, {
"name" : "_hoodie_record_key",
"type" : [ "null", "string" ],
"doc" : "",
"default" : null
}, {
"name" : "_hoodie_partition_path",
"type" : [ "null", "string" ],
"doc" : "",
"default" : null
}, {
"name" : "_hoodie_file_name",
"type" : [ "null", "string" ],
"doc" : "",
"default" : null
}, {
"name" : "id",
"type" : [ "null", "long" ],
"default" : null
}, {
"name" : "name",
"type" : [ "null", "string" ],
"default" : null
}, {
"name" : "note",
"type" : "string"
}, {
"name" : "ts",
"type" : [ "null", "long" ],
"default" : null
}, {
"name" : "dt",
"type" : [ "null", "string" ],
"default" : null
} ]
}
at org.apache.hudi.io.HoodieMergeHandle.write(HoodieMergeHandle.java:356)
at org.apache.hudi.table.action.commit.AbstractMergeHelper$UpdateHandler.consumeOneRecord(AbstractMergeHelper.java:122)
at org.apache.hudi.table.action.commit.AbstractMergeHelper$UpdateHandler.consumeOneRecord(AbstractMergeHelper.java:112)
at org.apache.hudi.common.util.queue.BoundedInMemoryQueueConsumer.consume(BoundedInMemoryQueueConsumer.java:37)
at org.apache.hudi.common.util.queue.BoundedInMemoryExecutor.lambda$null$2(BoundedInMemoryExecutor.java:121)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
... 3 more
Caused by: java.lang.RuntimeException: Null-value for required field: note
at org.apache.parquet.avro.AvroWriteSupport.writeRecordFields(AvroWriteSupport.java:200)
at org.apache.parquet.avro.AvroWriteSupport.write(AvroWriteSupport.java:171)
at org.apache.parquet.hadoop.InternalParquetRecordWriter.write(InternalParquetRecordWriter.java:138)
at org.apache.parquet.hadoop.ParquetWriter.write(ParquetWriter.java:310)
at org.apache.hudi.io.storage.HoodieParquetWriter.writeAvro(HoodieParquetWriter.java:95)
at org.apache.hudi.io.HoodieMergeHandle.write(HoodieMergeHandle.java:351)
... 8 more
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2559)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2508)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2507)
at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2507)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1149)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1149)
at scala.Option.foreach(Option.scala:407)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1149)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2747)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2689)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2678)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:938)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2215)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2236)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2255)
at org.apache.spark.rdd.RDD.$anonfun$take$1(RDD.scala:1449)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:414)
at org.apache.spark.rdd.RDD.take(RDD.scala:1422)
at org.apache.spark.rdd.RDD.$anonfun$isEmpty$1(RDD.scala:1557)
at scala.runtime.java8.JFunction0$mcZ$sp.apply(JFunction0$mcZ$sp.java:23)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:414)
at org.apache.spark.rdd.RDD.isEmpty(RDD.scala:1557)
at org.apache.hudi.HoodieSparkSqlWriter$.commitAndPerformPostOperations(HoodieSparkSqlWriter.scala:657)
at org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:287)
at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:169)
at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:75)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:73)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:84)
at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:115)
at org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:107)
at org.apache.spark.sql.execution.SQLExecution$.withTracker(SQLExecution.scala:232)
at org.apache.spark.sql.execution.SQLExecution$.executeQuery$1(SQLExecution.scala:110)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:135)
at org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:107)
at org.apache.spark.sql.execution.SQLExecution$.withTracker(SQLExecution.scala:232)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:135)
at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:253)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:134)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:68)
at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:112)
at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:108)
at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:519)
at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:83)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:519)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDownWithPruning(LogicalPlan.scala:30)
at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning(AnalysisHelper.scala:267)
at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning$(AnalysisHelper.scala:263)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:30)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:30)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:495)
at org.apache.spark.sql.execution.QueryExecution.eagerlyExecuteCommands(QueryExecution.scala:108)
at org.apache.spark.sql.execution.QueryExecution.commandExecuted$lzycompute(QueryExecution.scala:95)
at org.apache.spark.sql.execution.QueryExecution.commandExecuted(QueryExecution.scala:93)
at org.apache.spark.sql.execution.QueryExecution.assertCommandExecuted(QueryExecution.scala:136)
at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:848)
at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:382)
at org.apache.spark.sql.DataFrameWriter.saveInternal(DataFrameWriter.scala:355)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:247)
at org.apache.spark.sql.hudi.command.UpdateHoodieTableCommand.run(UpdateHoodieTableCommand.scala:79)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:75)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:73)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:84)
at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:115)
at org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:107)
at org.apache.spark.sql.execution.SQLExecution$.withTracker(SQLExecution.scala:232)
at org.apache.spark.sql.execution.SQLExecution$.executeQuery$1(SQLExecution.scala:110)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:135)
at org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:107)
at org.apache.spark.sql.execution.SQLExecution$.withTracker(SQLExecution.scala:232)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:135)
at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:253)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:134)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:68)
at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:112)
at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:108)
at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:519)
at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:83)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:519)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDownWithPruning(LogicalPlan.scala:30)
at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning(AnalysisHelper.scala:267)
at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning$(AnalysisHelper.scala:263)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:30)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:30)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:495)
at org.apache.spark.sql.execution.QueryExecution.eagerlyExecuteCommands(QueryExecution.scala:108)
at org.apache.spark.sql.execution.QueryExecution.commandExecuted$lzycompute(QueryExecution.scala:95)
at org.apache.spark.sql.execution.QueryExecution.commandExecuted(QueryExecution.scala:93)
at org.apache.spark.sql.Dataset.<init>(Dataset.scala:221)
at org.apache.spark.sql.Dataset$.$anonfun$ofRows$2(Dataset.scala:101)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:98)
at org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:618)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:613)
at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:651)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.zeppelin.spark.SparkSqlInterpreter.internalInterpret(SparkSqlInterpreter.java:106)
at org.apache.zeppelin.interpreter.AbstractInterpreter.interpret(AbstractInterpreter.java:55)
at org.apache.zeppelin.interpreter.LazyOpenInterpreter.interpret(LazyOpenInterpreter.java:110)
at org.apache.zeppelin.interpreter.remote.RemoteInterpreterServer$InterpretJob.jobRun(RemoteInterpreterServer.java:849)
at org.apache.zeppelin.interpreter.remote.RemoteInterpreterServer$InterpretJob.jobRun(RemoteInterpreterServer.java:741)
at org.apache.zeppelin.scheduler.Job.run(Job.java:172)
at org.apache.zeppelin.scheduler.AbstractScheduler.runJob(AbstractScheduler.java:132)
at org.apache.zeppelin.scheduler.FIFOScheduler.lambda$runJobInScheduler$0(FIFOScheduler.java:42)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750)
Caused by: org.apache.hudi.exception.HoodieUpsertException: Error upserting bucketType UPDATE for partition :0
at org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.handleUpsertPartition(BaseSparkCommitActionExecutor.java:322)
at org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.lambda$execute$ecf5068c$1(BaseSparkCommitActionExecutor.java:174)
at org.apache.spark.api.java.JavaRDDLike.$anonfun$mapPartitionsWithIndex$1(JavaRDDLike.scala:102)
at org.apache.spark.api.java.JavaRDDLike.$anonfun$mapPartitionsWithIndex$1$adapted(JavaRDDLike.scala:102)
at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsWithIndex$2(RDD.scala:915)
at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsWithIndex$2$adapted(RDD.scala:915)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
at org.apache.spark.rdd.RDD.$anonfun$getOrCompute$1(RDD.scala:386)
at org.apache.spark.storage.BlockManager.$anonfun$doPutIterator$1(BlockManager.scala:1498)
at org.apache.spark.storage.BlockManager.org$apache$spark$storage$BlockManager$$doPut(BlockManager.scala:1408)
at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1472)
at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:1295)
at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:384)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:335)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:133)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1474)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509)
... 3 more
Caused by: org.apache.hudi.exception.HoodieException: org.apache.hudi.exception.HoodieException: java.util.concurrent.ExecutionException: org.apache.hudi.exception.HoodieUpsertException: Failed to merge old record into new file for key id:2 from old file s3://my-test-bucket/hudi/update_null_test_cow/dt=2022-06-08/730df40f-7973-48eb-a494-b167030bfd37-0_0-124-6252_20220608044435168.parquet to new file s3://my-test-bucket/hudi/update_null_test_cow/dt=2022-06-08/730df40f-7973-48eb-a494-b167030bfd37-0_0-171-8477_20220608044759733.parquet with writerSchema {
"type" : "record",
"name" : "update_null_test_cow_record",
"namespace" : "hoodie.update_null_test_cow",
"fields" : [ {
"name" : "_hoodie_commit_time",
"type" : [ "null", "string" ],
"doc" : "",
"default" : null
}, {
"name" : "_hoodie_commit_seqno",
"type" : [ "null", "string" ],
"doc" : "",
"default" : null
}, {
"name" : "_hoodie_record_key",
"type" : [ "null", "string" ],
"doc" : "",
"default" : null
}, {
"name" : "_hoodie_partition_path",
"type" : [ "null", "string" ],
"doc" : "",
"default" : null
}, {
"name" : "_hoodie_file_name",
"type" : [ "null", "string" ],
"doc" : "",
"default" : null
}, {
"name" : "id",
"type" : [ "null", "long" ],
"default" : null
}, {
"name" : "name",
"type" : [ "null", "string" ],
"default" : null
}, {
"name" : "note",
"type" : "string"
}, {
"name" : "ts",
"type" : [ "null", "long" ],
"default" : null
}, {
"name" : "dt",
"type" : [ "null", "string" ],
"default" : null
} ]
}
at org.apache.hudi.table.action.commit.SparkMergeHelper.runMerge(SparkMergeHelper.java:102)
at org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.handleUpdateInternal(BaseSparkCommitActionExecutor.java:351)
at org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.handleUpdate(BaseSparkCommitActionExecutor.java:342)
at org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.handleUpsertPartition(BaseSparkCommitActionExecutor.java:315)
... 28 more
Caused by: org.apache.hudi.exception.HoodieException: java.util.concurrent.ExecutionException: org.apache.hudi.exception.HoodieUpsertException: Failed to merge old record into new file for key id:2 from old file s3://my-test-bucket/hudi/update_null_test_cow/dt=2022-06-08/730df40f-7973-48eb-a494-b167030bfd37-0_0-124-6252_20220608044435168.parquet to new file s3://my-test-bucket/hudi/update_null_test_cow/dt=2022-06-08/730df40f-7973-48eb-a494-b167030bfd37-0_0-171-8477_20220608044759733.parquet with writerSchema {
"type" : "record",
"name" : "update_null_test_cow_record",
"namespace" : "hoodie.update_null_test_cow",
"fields" : [ {
"name" : "_hoodie_commit_time",
"type" : [ "null", "string" ],
"doc" : "",
"default" : null
}, {
"name" : "_hoodie_commit_seqno",
"type" : [ "null", "string" ],
"doc" : "",
"default" : null
}, {
"name" : "_hoodie_record_key",
"type" : [ "null", "string" ],
"doc" : "",
"default" : null
}, {
"name" : "_hoodie_partition_path",
"type" : [ "null", "string" ],
"doc" : "",
"default" : null
}, {
"name" : "_hoodie_file_name",
"type" : [ "null", "string" ],
"doc" : "",
"default" : null
}, {
"name" : "id",
"type" : [ "null", "long" ],
"default" : null
}, {
"name" : "name",
"type" : [ "null", "string" ],
"default" : null
}, {
"name" : "note",
"type" : "string"
}, {
"name" : "ts",
"type" : [ "null", "long" ],
"default" : null
}, {
"name" : "dt",
"type" : [ "null", "string" ],
"default" : null
} ]
}
at org.apache.hudi.common.util.queue.BoundedInMemoryExecutor.execute(BoundedInMemoryExecutor.java:147)
at org.apache.hudi.table.action.commit.SparkMergeHelper.runMerge(SparkMergeHelper.java:100)
... 31 more
Caused by: java.util.concurrent.ExecutionException: org.apache.hudi.exception.HoodieUpsertException: Failed to merge old record into new file for key id:2 from old file s3://my-test-bucket/hudi/update_null_test_cow/dt=2022-06-08/730df40f-7973-48eb-a494-b167030bfd37-0_0-124-6252_20220608044435168.parquet to new file s3://my-test-bucket/hudi/update_null_test_cow/dt=2022-06-08/730df40f-7973-48eb-a494-b167030bfd37-0_0-171-8477_20220608044759733.parquet with writerSchema {
"type" : "record",
"name" : "update_null_test_cow_record",
"namespace" : "hoodie.update_null_test_cow",
"fields" : [ {
"name" : "_hoodie_commit_time",
"type" : [ "null", "string" ],
"doc" : "",
"default" : null
}, {
"name" : "_hoodie_commit_seqno",
"type" : [ "null", "string" ],
"doc" : "",
"default" : null
}, {
"name" : "_hoodie_record_key",
"type" : [ "null", "string" ],
"doc" : "",
"default" : null
}, {
"name" : "_hoodie_partition_path",
"type" : [ "null", "string" ],
"doc" : "",
"default" : null
}, {
"name" : "_hoodie_file_name",
"type" : [ "null", "string" ],
"doc" : "",
"default" : null
}, {
"name" : "id",
"type" : [ "null", "long" ],
"default" : null
}, {
"name" : "name",
"type" : [ "null", "string" ],
"default" : null
}, {
"name" : "note",
"type" : "string"
}, {
"name" : "ts",
"type" : [ "null", "long" ],
"default" : null
}, {
"name" : "dt",
"type" : [ "null", "string" ],
"default" : null
} ]
}
at java.util.concurrent.FutureTask.report(FutureTask.java:122)
at java.util.concurrent.FutureTask.get(FutureTask.java:192)
at org.apache.hudi.common.util.queue.BoundedInMemoryExecutor.execute(BoundedInMemoryExecutor.java:141)
... 32 more
Caused by: org.apache.hudi.exception.HoodieUpsertException: Failed to merge old record into new file for key id:2 from old file s3://my-test-bucket/hudi/update_null_test_cow/dt=2022-06-08/730df40f-7973-48eb-a494-b167030bfd37-0_0-124-6252_20220608044435168.parquet to new file s3://my-test-bucket/hudi/update_null_test_cow/dt=2022-06-08/730df40f-7973-48eb-a494-b167030bfd37-0_0-171-8477_20220608044759733.parquet with writerSchema {
"type" : "record",
"name" : "update_null_test_cow_record",
"namespace" : "hoodie.update_null_test_cow",
"fields" : [ {
"name" : "_hoodie_commit_time",
"type" : [ "null", "string" ],
"doc" : "",
"default" : null
}, {
"name" : "_hoodie_commit_seqno",
"type" : [ "null", "string" ],
"doc" : "",
"default" : null
}, {
"name" : "_hoodie_record_key",
"type" : [ "null", "string" ],
"doc" : "",
"default" : null
}, {
"name" : "_hoodie_partition_path",
"type" : [ "null", "string" ],
"doc" : "",
"default" : null
}, {
"name" : "_hoodie_file_name",
"type" : [ "null", "string" ],
"doc" : "",
"default" : null
}, {
"name" : "id",
"type" : [ "null", "long" ],
"default" : null
}, {
"name" : "name",
"type" : [ "null", "string" ],
"default" : null
}, {
"name" : "note",
"type" : "string"
}, {
"name" : "ts",
"type" : [ "null", "long" ],
"default" : null
}, {
"name" : "dt",
"type" : [ "null", "string" ],
"default" : null
} ]
}
at org.apache.hudi.io.HoodieMergeHandle.write(HoodieMergeHandle.java:356)
at org.apache.hudi.table.action.commit.AbstractMergeHelper$UpdateHandler.consumeOneRecord(AbstractMergeHelper.java:122)
at org.apache.hudi.table.action.commit.AbstractMergeHelper$UpdateHandler.consumeOneRecord(AbstractMergeHelper.java:112)
at org.apache.hudi.common.util.queue.BoundedInMemoryQueueConsumer.consume(BoundedInMemoryQueueConsumer.java:37)
at org.apache.hudi.common.util.queue.BoundedInMemoryExecutor.lambda$null$2(BoundedInMemoryExecutor.java:121)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
... 3 more
Caused by: java.lang.RuntimeException: Null-value for required field: note
at org.apache.parquet.avro.AvroWriteSupport.writeRecordFields(AvroWriteSupport.java:200)
at org.apache.parquet.avro.AvroWriteSupport.write(AvroWriteSupport.java:171)
at org.apache.parquet.hadoop.InternalParquetRecordWriter.write(InternalParquetRecordWriter.java:138)
at org.apache.parquet.hadoop.ParquetWriter.write(ParquetWriter.java:310)
at org.apache.hudi.io.storage.HoodieParquetWriter.writeAvro(HoodieParquetWriter.java:95)
at org.apache.hudi.io.HoodieMergeHandle.write(HoodieMergeHandle.java:351)
... 8 more
Issue Analytics
- State:
- Created a year ago
- Comments:11 (8 by maintainers)
Top Results From Across the Web
Spark Guide - Apache Hudi
This guide provides a quick peek at Hudi's capabilities using spark-shell. Using Spark datasources, we will walk through.
Read more >Amazon EMR 6.8 supports Apache Hudi 0.11.1 and ... - AWS
Apache Hudi 0.11.1 on Amazon EMR 6.8 includes support for Spark 3.3.0, ... adding support for update or delete records in Hudi tables...
Read more >Apache Spark job fails with Parquet column cannot be ...
You are reading data in Parquet format and writing to a Delta table when you get a Parquet column cannot be converted error...
Read more >HoodieKeyException Is Reported When Data Is ... - 华为云
Is it possible to use a nullable field that contains null records as a primary key when creating a Hudi table?No. HoodieKeyException will...
Read more >Writing spark DataFrame In Apache Hudi Table - Stack Overflow
Here is a working sample for your question in pyspark: from pyspark.sql import SparkSession from pyspark.sql.functions import lit spark ...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
@XuQianJin-Stars @minihippo : can you assist here please.
@XuQianJin-Stars @minihippo : Can you assist here on setting null defaults from spark-sql.