question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[SUPPORT] HELP :: Using TWO FIELDS to precombine :: 'hoodie.datasource.write.precombine.field': "column1,column2"

See original GitHub issue

ERROR WHILE LOADING INCREMENTAL DATA

An error occurred while calling o605.save. Failed to upsert for commit time 20220813064526092

format(target_id, ".", name), value)
py4j.protocol.Py4JJavaError: An error occurred while calling o605.save.
: org.apache.hudi.exception.HoodieUpsertException: Failed to upsert for commit time 20220812223251416
   at org.apache.hudi.table.action.commit.AbstractWriteHelper.write(AbstractWriteHelper.java:63)
   at org.apache.hudi.table.action.commit.SparkUpsertCommitActionExecutor.execute(SparkUpsertCommitActionExecutor.java:46)
   at org.apache.hudi.table.HoodieSparkCopyOnWriteTable.upsert(HoodieSparkCopyOnWriteTable.java:119)
   at org.apache.hudi.table.HoodieSparkCopyOnWriteTable.upsert(HoodieSparkCopyOnWriteTable.java:103)
   at org.apache.hudi.client.SparkRDDWriteClient.upsert(SparkRDDWriteClient.java:160)
   at org.apache.hudi.DataSourceUtils.doWriteOperation(DataSourceUtils.java:217)
   at org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:277)
   at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:164)
   at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:46)
   at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
   at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
   at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:90)
   at org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:185)
   at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:223)
   at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
   at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:220)
   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:181)
   at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:134)
   at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:133)
   at org.apache.spark.sql.DataFrameWriter.$anonfun$runCommand$1(DataFrameWriter.scala:989)
   at org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:107)
   at org.apache.spark.sql.execution.SQLExecution$.withTracker(SQLExecution.scala:232)
   at org.apache.spark.sql.execution.SQLExecution$.executeQuery$1(SQLExecution.scala:110)
   at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:135)
   at org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:107)
   at org.apache.spark.sql.execution.SQLExecution$.withTracker(SQLExecution.scala:232)
   at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:135)
   at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:253)
   at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:134)
   at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:772)
   at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:68)
   at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:989)
   at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:438)
   at org.apache.spark.sql.DataFrameWriter.saveInternal(DataFrameWriter.scala:415)
   at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:301)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
   at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:498)
   at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
   at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
   at py4j.Gateway.invoke(Gateway.java:282)
   at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
   at py4j.commands.CallCommand.execute(CallCommand.java:79)
   at py4j.GatewayConnection.run(GatewayConnection.java:238)
   at java.lang.Thread.run(Thread.java:750)
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: T
ask 47 in stage 8.0 failed 4 times, most recent failure: Lost task 47.3 in stage 8.0 (TID 1833)
(162.44.118.51 executor 7): 
org.apache.hudi.exception.HoodieException: score,capture_date(Part -score,capture_date) 
field not found in record. Acceptable fields were :
[column1,column2, .................., score,  capture_date ]__ 

score(double) and capture_date(timestamp, not null) Columns are present in column list.

But still facing issue.

CONFIG USED :: commonConfig = {‘className’ : ‘org.apache.hudi’, ‘hoodie.datasource.hive_sync.use_jdbc’:‘false’, ‘hoodie.datasource.write.precombine.field’: “score,capture_date”, #### USING TWO COLUMNS ###### ‘hoodie.datasource.write.recordkey.field’: ‘uuid’, ‘hoodie.table.name’: ‘sales’, ‘hoodie.consistency.check.enabled’: ‘true’, ‘hoodie.datasource.hive_sync.database’: ‘sales’, ‘hoodie.datasource.hive_sync.table’: ‘sales’, ‘hoodie.datasource.hive_sync.enable’: ‘true’, ‘path’:“s3://datawarehouse/DATA/DEV/gold/sales/”, ‘hoodie.index.type’: ‘GLOBAL_SIMPLE’, ‘hoodie.simple.index.update.partition.path’: ‘true’, ‘hoodie.global.simple.index.parallelism’: ‘20’ }

partitionDataConfig = { ‘hoodie.datasource.write.keygenerator.class’ : ‘org.apache.hudi.keygen.ComplexKeyGenerator’, ‘hoodie.datasource.write.partitionpath.field’: “country,zipcode”, ‘hoodie.datasource.hive_sync.partition_extractor_class’: ‘org.apache.hudi.hive.MultiPartKeysValueExtractor’, ‘hoodie.datasource.hive_sync.partition_fields’: “country,zipcode”, ‘hoodie.datasource.write.hive_style_partitioning’: ‘true’ }

#USED FOR FIRST TIME BULK INSERT# initLoadConfig = {‘hoodie.bulkinsert.shuffle.parallelism’: 20, ‘hoodie.datasource.write.operation’: ‘bulk_insert’}

incrementalWriteConfig = { ‘hoodie.upsert.shuffle.parallelism’: 20, ‘hoodie.datasource.write.operation’: ‘upsert’, ‘hoodie.cleaner.policy’: ‘KEEP_LATEST_COMMITS’, ‘hoodie.cleaner.commits.retained’: 5 }

upsertConf = {**commonConfig, **partitionDataConfig, **incrementalWriteConfig }

df.write.format(“org.apache.hudi”).options(**upsertConf).mode(‘append’).save()

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:5 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
shubham-bungeecommented, Aug 17, 2022

Unfortunately, there is no out of the box solution to use two fields as preCombine for now.

Thanks a lot for reply. We are a startup, planning to move to hudi, you might see few more support tickets coming your way. Your help would be great in building new architecture.

0reactions
xushiyancommented, Oct 31, 2022

Some previous efforts on this feature; still WIP

https://github.com/apache/hudi/pull/2519

Read more comments on GitHub >

github_iconTop Results From Across the Web

All Configurations | Apache Hudi
This page covers the different ways of configuring your job to write/read Hudi tables. At a high level, you can control behaviour at...
Read more >
[SUPPORT] hoodie.datasource.write.precombine.field not ...
Hi team, In one of our tables, we have Version as a Pre combine field and in the write option we have used...
Read more >
Why apache-hudi is creating COPY_ON_WRITE table even if I ...
The issue is with "hoodie.table.type": "MERGE_ON_READ", configuration. You have to use hoodie.datasource.write.table.type instead.
Read more >
Apache Hudi: Basic CRUD operations - Medium
datasource.write.precombine.field`) : Field used to de-dup multiple records among the batch of records thats being ingested. `hoodie.insert.
Read more >
Key Learnings on Using Apache HUDI in building Lakehouse ...
In this blog, we will go through Apache HUDI in detail and how it helped us in ... hoodie.datasource.write.precombine.field: precombine
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found