Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[SUPPORT] HELP :: Using TWO FIELDS to precombine :: 'hoodie.datasource.write.precombine.field': "column1,column2"

See original GitHub issue

ERROR WHILE LOADING INCREMENTAL DATA

An error occurred while calling o605.save. Failed to upsert for commit time 20220813064526092

format(target_id, ".", name), value)
py4j.protocol.Py4JJavaError: An error occurred while calling o605.save.
: org.apache.hudi.exception.HoodieUpsertException: Failed to upsert for commit time 20220812223251416
   at org.apache.hudi.table.action.commit.AbstractWriteHelper.write(AbstractWriteHelper.java:63)
   at org.apache.hudi.table.action.commit.SparkUpsertCommitActionExecutor.execute(SparkUpsertCommitActionExecutor.java:46)
   at org.apache.hudi.table.HoodieSparkCopyOnWriteTable.upsert(HoodieSparkCopyOnWriteTable.java:119)
   at org.apache.hudi.table.HoodieSparkCopyOnWriteTable.upsert(HoodieSparkCopyOnWriteTable.java:103)
   at org.apache.hudi.client.SparkRDDWriteClient.upsert(SparkRDDWriteClient.java:160)
   at org.apache.hudi.DataSourceUtils.doWriteOperation(DataSourceUtils.java:217)
   at org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:277)
   at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:164)
   at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:46)
   at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
   at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
   at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:90)
   at org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:185)
   at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:223)
   at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
   at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:220)
   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:181)
   at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:134)
   at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:133)
   at org.apache.spark.sql.DataFrameWriter.$anonfun$runCommand$1(DataFrameWriter.scala:989)
   at org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:107)
   at org.apache.spark.sql.execution.SQLExecution$.withTracker(SQLExecution.scala:232)
   at org.apache.spark.sql.execution.SQLExecution$.executeQuery$1(SQLExecution.scala:110)
   at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:135)
   at org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:107)
   at org.apache.spark.sql.execution.SQLExecution$.withTracker(SQLExecution.scala:232)
   at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:135)
   at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:253)
   at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:134)
   at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:772)
   at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:68)
   at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:989)
   at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:438)
   at org.apache.spark.sql.DataFrameWriter.saveInternal(DataFrameWriter.scala:415)
   at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:301)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
   at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:498)
   at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
   at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
   at py4j.Gateway.invoke(Gateway.java:282)
   at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
   at py4j.commands.CallCommand.execute(CallCommand.java:79)
   at py4j.GatewayConnection.run(GatewayConnection.java:238)
   at java.lang.Thread.run(Thread.java:750)
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: T
ask 47 in stage 8.0 failed 4 times, most recent failure: Lost task 47.3 in stage 8.0 (TID 1833)
(162.44.118.51 executor 7): 
org.apache.hudi.exception.HoodieException: score,capture_date(Part -score,capture_date) 
field not found in record. Acceptable fields were :
[column1,column2, .................., score,  capture_date ]__

score(double) and capture_date(timestamp, not null) Columns are present in column list.

But still facing issue.

CONFIG USED :: commonConfig = {‘className’ : ‘org.apache.hudi’, ‘hoodie.datasource.hive_sync.use_jdbc’:‘false’, ‘hoodie.datasource.write.precombine.field’: “score,capture_date”, #### USING TWO COLUMNS ###### ‘hoodie.datasource.write.recordkey.field’: ‘uuid’, ‘hoodie.table.name’: ‘sales’, ‘hoodie.consistency.check.enabled’: ‘true’, ‘hoodie.datasource.hive_sync.database’: ‘sales’, ‘hoodie.datasource.hive_sync.table’: ‘sales’, ‘hoodie.datasource.hive_sync.enable’: ‘true’, ‘path’:“s3://datawarehouse/DATA/DEV/gold/sales/”, ‘hoodie.index.type’: ‘GLOBAL_SIMPLE’, ‘hoodie.simple.index.update.partition.path’: ‘true’, ‘hoodie.global.simple.index.parallelism’: ‘20’ }

partitionDataConfig = { ‘hoodie.datasource.write.keygenerator.class’ : ‘org.apache.hudi.keygen.ComplexKeyGenerator’, ‘hoodie.datasource.write.partitionpath.field’: “country,zipcode”, ‘hoodie.datasource.hive_sync.partition_extractor_class’: ‘org.apache.hudi.hive.MultiPartKeysValueExtractor’, ‘hoodie.datasource.hive_sync.partition_fields’: “country,zipcode”, ‘hoodie.datasource.write.hive_style_partitioning’: ‘true’ }

#USED FOR FIRST TIME BULK INSERT# initLoadConfig = {‘hoodie.bulkinsert.shuffle.parallelism’: 20, ‘hoodie.datasource.write.operation’: ‘bulk_insert’}

incrementalWriteConfig = { ‘hoodie.upsert.shuffle.parallelism’: 20, ‘hoodie.datasource.write.operation’: ‘upsert’, ‘hoodie.cleaner.policy’: ‘KEEP_LATEST_COMMITS’, ‘hoodie.cleaner.commits.retained’: 5 }

upsertConf = {**commonConfig, **partitionDataConfig, **incrementalWriteConfig }

df.write.format(“org.apache.hudi”).options(**upsertConf).mode(‘append’).save()

Issue Analytics

State:
Created a year ago
Comments:5 (4 by maintainers)

Top GitHub Comments

1reaction

shubham-bungeecommented, Aug 17, 2022

Unfortunately, there is no out of the box solution to use two fields as preCombine for now.

Thanks a lot for reply. We are a startup, planning to move to hudi, you might see few more support tickets coming your way. Your help would be great in building new architecture.

0reactions

xushiyancommented, Oct 31, 2022

Some previous efforts on this feature; still WIP

https://github.com/apache/hudi/pull/2519