[SUPPORT] HoodieKeyException: recordKey value: "null"
See original GitHub issueTips before filing an issue
- Have you gone through our FAQs? Yes. Right address is https://hudi.apache.org/learn/faq/
Describe the problem you faced
Write operation using builk_insert fails, when writing to non-empty Hudi table. It does not fail if table is empty.
To Reproduce
Steps to reproduce the behavior:
- Create new table and write some data with bulk_insert option.
- Write the same data batch to this table with bulk_insert option.
Hudi settings:
val unpartitionDataConfig = Map(
HIVE_PARTITION_EXTRACTOR_CLASS.key -> classOf[
NonPartitionedExtractor
].getName,
KEYGENERATOR_CLASS_NAME.key -> classOf[NonpartitionedKeyGenerator].getName
)
private def options(
table: String,
primaryKey: String,
database: String,
operation: String
): Map[String, String] =
Map(
OPERATION.key -> operation,
PRECOMBINE_FIELD.key -> EventTimestampColumn,
RECORDKEY_FIELD.key -> primaryKey,
TABLE_TYPE.key -> COW_TABLE_TYPE_OPT_VAL,
TBL_NAME.key -> table,
"hoodie.consistency.check.enabled" -> "true",
HIVE_SYNC_MODE.key -> "jdbc",
HIVE_SYNC_ENABLED.key -> "true",
HIVE_SUPPORT_TIMESTAMP_TYPE.key -> "true",
HIVE_DATABASE.key -> database,
HIVE_TABLE.key -> table,
UPSERT_PARALLELISM_VALUE.key -> "4",
DELETE_PARALLELISM_VALUE.key -> "4",
BULKINSERT_PARALLELISM_VALUE.key -> "4"
) ++ unpartitionDataConfig
def writerOptions(
table: String,
primaryKey: String,
database: String
): Map[String, String] = {
val operation = BULK_INSERT_OPERATION_OPT_VAL
options(
table,
primaryKey,
database,
operation
) ++ unpartitionDataConfig
}
Spark main code:
val options = writerOptions(
tableName,
primaryKey,
database
)
session.read.format("parquet")
.load(inputPath)
.write
.format("hudi")
.options(options)
.mode(SaveMode.Overwrite)
.save(targetPath)
Expected behavior
Data is overwritten when second step is finished. No logical duplicates in the table.
Environment Description
-
Hudi version : 0.9 (“org.apache.hudi” %% “hudi-spark3-bundle” % “0.9.0”) Self-package in fat jar with Spark app.
-
Spark version : 3.1.2 (EMR)
-
Hive version : AWS Glue
-
Hadoop version : Hadoop 3.2.1 (EMR)
-
Storage (HDFS/S3/GCS…) : S3
-
Running on Docker? (yes/no) : no
Additional context
- It is dynamic issues. Sometimes it works fine, so that my Spark job can successfully overwrite the table with bulk_insert.
- I am using JVM concurrency via Scala Spark code by writing several tables via Spark in parallel. Perhaps that leads to some Hudi / Spark thread-safety issue?
Stacktrace
21/10/07 12:03:18 INFO YarnScheduler: Killing all running tasks in stage 17: Stage cancelled
21/10/07 12:03:18 INFO DAGScheduler: ResultStage 17 (save at HoodieSparkSqlWriter.scala:463) failed in 3.282 s due to Job aborted due to stage failure: Task 0 in stage 17.0 failed 4 times, most recent failure: Lost task 0.3 in stage 17.0 (TID 32) (ip-10-100-160-252.....local executor 1): org.apache.spark.SparkException: Failed to execute user defined function(UDFRegistration$$Lambda$2098/1888531409: (struct<here comes my table schema in struct format.... it has many columns and they have different logical types>) => string)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.writeFields_0_0$(Unknown Source)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source)
at scala.collection.Iterator$$anon$10.next(Iterator.scala:459)
at scala.collection.Iterator$$anon$10.next(Iterator.scala:459)
at scala.collection.Iterator$$anon$10.next(Iterator.scala:459)
at org.apache.spark.util.random.SamplingUtils$.reservoirSampleAndCount(SamplingUtils.scala:42)
at org.apache.spark.RangePartitioner$.$anonfun$sketch$1(Partitioner.scala:306)
at org.apache.spark.RangePartitioner$.$anonfun$sketch$1$adapted(Partitioner.scala:304)
at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsWithIndex$2(RDD.scala:915)
at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsWithIndex$2$adapted(RDD.scala:915)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:131)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.hudi.exception.HoodieKeyException: recordKey value: "null" for field: "my_primary_key_column_here" cannot be null or empty.
at org.apache.hudi.keygen.KeyGenUtils.getRecordKey(KeyGenUtils.java:141)
at org.apache.hudi.keygen.NonpartitionedAvroKeyGenerator.getRecordKey(NonpartitionedAvroKeyGenerator.java:60)
at org.apache.hudi.keygen.NonpartitionedKeyGenerator.getRecordKey(NonpartitionedKeyGenerator.java:50)
at org.apache.hudi.keygen.BaseKeyGenerator.getKey(BaseKeyGenerator.java:62)
at org.apache.hudi.keygen.BuiltinKeyGenerator.getRecordKey(BuiltinKeyGenerator.java:75)
at org.apache.spark.sql.UDFRegistration.$anonfun$register$352(UDFRegistration.scala:777)
... 22 more
Issue Analytics
- State:
- Created 2 years ago
- Comments:16 (8 by maintainers)
Top Results From Across the Web
HoodieKeyException Is Reported When Data Is ... - 华为云
Is it possible to use a nullable field that contains null records as a primary key when creating a Hudi table?No. HoodieKeyException will...
Read more >More than 1 column in record key in spark Hudi Job while ...
Below is the data I am trying to write using apache spark framework. ... HoodieKeyException: recordKey value: "null" for field: "albumId, ...
Read more >[#HUDI-2307] When using delete_partition with ds should not ...
Caused by: org.apache.hudi.exception.HoodieKeyException: recordKey value: "null" for field: "uuid" cannot be null or empty.
Read more >[jira] [Updated] (HUDI-2307) Fix the need for a primary key ...
HoodieKeyException: recordKey value: "null" for field: "uuid" cannot be null or empty.Caused by: org.apache.hudi.exception.
Read more >Index or primary key cannot contain a Null value (Error 3058)
Have questions or feedback about Office VBA or this documentation? Please see Office VBA support and feedback for guidance about the ways you ......
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
@novakov-alexey @Carl-Zhou-CN Thanks for making the fix and the test. such a great collaboration!
As JIRA was filed here, closing this issue.
https://issues.apache.org/jira/browse/HUDI-2582
when we are using complexkeyGenerator , Primarykeycolumns,partition key columns and Precombinekeycolumn are casesensitive and we should using the right case in conf file or hudi write options