Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[SUPPORT] org.apache.hudi.exception.HoodieException: The value of <field> can not be null

See original GitHub issue

Describe the problem you faced

I’m running Hudi 0.9.0, creating an external Hudi table on S3, and when trying to insert into this table using Spark SQL, it fails with exception org.apache.hudi.exception.HoodieException: The value of <field> can not be null.

The <field> is always the last field of my table. If I try to change the field, then the error also appears for the new last field.
The <field> is not part of my primary_key or combine_key (or anything else). It’s just a “normal” field.
This error does not happen when there is no primaryKey in OPTIONS of the create table statement
This error does not happen when there is a primaryKey and a preCombineField in OPTIONS of the create table statement

Wondering if this means that when we have a primaryKey, then we must have a preCombineField? In that case, an example in https://hudi.apache.org/docs/quick-start-guide might is wrong because it creates a table with primaryKey but without preCombineField.

Thanks!

To Reproduce

Steps to reproduce the behavior:

Run a create table statement

CREATE TABLE IF NOT EXISTS <schema>.<table_name> (
    <field1> <datatype1>,
    <field2> <datatype2>,
    <field3> <datatype2>
                 ....
    <fieldN> <datatypeN>
) USING hudi
LOCATION 's3a://<bucket>/<object>'
OPTIONS (
  type = 'cow',
  primaryKey = '<table_primary_key>'
);

Run insert into statement

insert into <schema>.<table_name>
select
    <field1>,
    <field2>,
    <field3>,
                 ....
    <fieldN>
from <schema>.<table_name>

Error is throw: org.apache.hudi.exception.HoodieException: The value of <field> can not be null

Expected behavior

I would expect data to be correctly inserted instead of throwing an error.

Environment Description

Hudi version : 0.9.0
Spark version : 2.4.4
Hive version : 2.3.5
Hadoop version :
Storage (HDFS/S3/GCS…) : S3
Running on Docker? (yes/no) : No

Stacktrace

27719 [task-result-getter-3] ERROR org.apache.spark.scheduler.TaskSetManager  - Task 0 in stage 3.0 failed 4 times; aborting job
Exception in thread "main" org.apache.hudi.exception.HoodieUpsertException: Failed to upsert for commit time 20211126023726
	at org.apache.hudi.table.action.commit.AbstractWriteHelper.write(AbstractWriteHelper.java:62)
	at org.apache.hudi.table.action.commit.SparkUpsertCommitActionExecutor.execute(SparkUpsertCommitActionExecutor.java:46)
	at org.apache.hudi.table.HoodieSparkCopyOnWriteTable.upsert(HoodieSparkCopyOnWriteTable.java:98)
	at org.apache.hudi.table.HoodieSparkCopyOnWriteTable.upsert(HoodieSparkCopyOnWriteTable.java:88)
	at org.apache.hudi.client.SparkRDDWriteClient.upsert(SparkRDDWriteClient.java:157)
	at org.apache.hudi.DataSourceUtils.doWriteOperation(DataSourceUtils.java:214)
	at org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:265)
	at org.apache.spark.sql.hudi.command.InsertIntoHoodieTableCommand$.run(InsertIntoHoodieTableCommand.scala:103)
	at org.apache.spark.sql.hudi.command.InsertIntoHoodieTableCommand.run(InsertIntoHoodieTableCommand.scala:59)
	at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
	at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
	at org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79)
	at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:194)
	at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:194)
	at org.apache.spark.sql.Dataset$$anonfun$52.apply(Dataset.scala:3370)
	at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
	at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
	at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
	at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3369)
	at org.apache.spark.sql.Dataset.<init>(Dataset.scala:194)
	at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:79)
	at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:642)
	at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:694)
	at com.twilio.optimustranformer.OptimusTranformer$$anonfun$main$1$$anonfun$apply$mcV$sp$1.apply(OptimusTranformer.scala:76)
	at com.twilio.optimustranformer.OptimusTranformer$$anonfun$main$1$$anonfun$apply$mcV$sp$1.apply(OptimusTranformer.scala:74)
	at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
	at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
	at com.twilio.optimustranformer.OptimusTranformer$$anonfun$main$1.apply$mcV$sp(OptimusTranformer.scala:73)
	at scala.util.control.Breaks.breakable(Breaks.scala:38)
	at com.twilio.optimustranformer.OptimusTranformer$.main(OptimusTranformer.scala:72)
	at com.twilio.optimustranformer.OptimusTranformer.main(OptimusTranformer.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
	at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:845)
	at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161)
	at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184)
	at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
	at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:920)
	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:929)
	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 3.0 failed 4 times, most recent failure: Lost task 0.3 in stage 3.0 (TID 96, 10.212.241.12, executor 0): org.apache.hudi.exception.HoodieException: The value of <field> can not be null
	at org.apache.hudi.avro.HoodieAvroUtils.getNestedFieldVal(HoodieAvroUtils.java:484)
	at org.apache.hudi.HoodieSparkSqlWriter$$anonfun$7.apply(HoodieSparkSqlWriter.scala:233)
	at org.apache.hudi.HoodieSparkSqlWriter$$anonfun$7.apply(HoodieSparkSqlWriter.scala:230)
	at scala.collection.Iterator$$anon$11.next(Iterator.scala:410)
	at scala.collection.Iterator$$anon$11.next(Iterator.scala:410)
	at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:193)
	at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:62)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
	at org.apache.spark.scheduler.Task.run(Task.scala:123)
	at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

Driver stacktrace:
	at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1889)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1877)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1876)
	at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
	at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1876)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:926)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:926)
	at scala.Option.foreach(Option.scala:257)
	at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:926)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2110)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2059)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2048)
	at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
	at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:737)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2061)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2082)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2101)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2126)
	at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:945)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
	at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
	at org.apache.spark.rdd.RDD.collect(RDD.scala:944)
	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$countByKey$1.apply(PairRDDFunctions.scala:370)
	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$countByKey$1.apply(PairRDDFunctions.scala:370)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
	at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
	at org.apache.spark.rdd.PairRDDFunctions.countByKey(PairRDDFunctions.scala:369)
	at org.apache.spark.api.java.JavaPairRDD.countByKey(JavaPairRDD.scala:312)
	at org.apache.hudi.index.bloom.SparkHoodieBloomIndex.lookupIndex(SparkHoodieBloomIndex.java:114)
	at org.apache.hudi.index.bloom.SparkHoodieBloomIndex.tagLocation(SparkHoodieBloomIndex.java:84)
	at org.apache.hudi.index.bloom.SparkHoodieBloomIndex.tagLocation(SparkHoodieBloomIndex.java:60)
	at org.apache.hudi.table.action.commit.AbstractWriteHelper.tag(AbstractWriteHelper.java:69)
	at org.apache.hudi.table.action.commit.AbstractWriteHelper.write(AbstractWriteHelper.java:51)
	... 42 more
Caused by: org.apache.hudi.exception.HoodieException: The value of <field> can not be null
	at org.apache.hudi.avro.HoodieAvroUtils.getNestedFieldVal(HoodieAvroUtils.java:484)
	at org.apache.hudi.HoodieSparkSqlWriter$$anonfun$7.apply(HoodieSparkSqlWriter.scala:233)
	at org.apache.hudi.HoodieSparkSqlWriter$$anonfun$7.apply(HoodieSparkSqlWriter.scala:230)
	at scala.collection.Iterator$$anon$11.next(Iterator.scala:410)
	at scala.collection.Iterator$$anon$11.next(Iterator.scala:410)
	at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:193)
	at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:62)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
	at org.apache.spark.scheduler.Task.run(Task.scala:123)
	at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

Issue Analytics

State:
Created 2 years ago
Comments:12 (11 by maintainers)

Top GitHub Comments

1reaction

dongkeluncommented, Jan 13, 2022

@nsivabalan @BenjMaq In Hudi version 0.9.0, sql insert takes the last field of the schema as the PRECOMBINE_FIELD by default. If the field is null, this exception may be thrown

PRECOMBINE_FIELD.key -> tableSchema.fields.last.name

0reactions

nsivabalancommented, Jan 25, 2022

thanks!

Top Results From Across the Web

Hoodie (Hudi) precombine field failing on NULL - Stack Overflow

... Exception in task 2.0 in stage 46.0 (TID 264) org.apache.hudi.exception.HoodieException: The value of last_modified_date can not be null.

[GitHub] [hudi] dongkelun edited a comment on issue #4131 ...

HoodieException : The value of can not be null ... @BenjMaq In Hudi version 0.9.0, sql insert takes the last field of the...

Troubleshooting - Apache Hudi

exception.HoodieKeyException: recordKey value: "null" for field: "name" cannot be null or empty. at org.apache.

Considerations and limitations for using Hudi on Amazon EMR

Record key field cannot be null or empty – The field that you specify as the record key field cannot have null or...

HoodieKeyException Is Reported When Data Is ... - 华为云

Is it possible to use a nullable field that contains null records as a primary key when creating a Hudi table?No. HoodieKeyException will...

Troubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.

Start Free

Top Related Reddit Thread

No results found

Top Related Tweet

No results found

Top Related Dev.to Post

No results found

[SUPPORT] org.apache.hudi.exception.HoodieException: The value of <field> can not be null

Issue Analytics

Top GitHub Comments

Top Results From Across the Web

Top Related Medium Post

Top Related StackOverflow Question

Troubleshoot Live Code

Top Related Reddit Thread

Top Related Hackernoon Post

Top Related Tweet

Top Related Dev.to Post

Top Related Hashnode Post

[SUPPORT] Zordering clustering on a moderate size dataset taking large amounts of time.

[SUPPORT] UPDATE command doest not working on Spark SQL