Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[SUPPORT] HoodieKeyException: recordKey value: "null"

See original GitHub issue

Tips before filing an issue

Have you gone through our FAQs? Yes. Right address is https://hudi.apache.org/learn/faq/

Describe the problem you faced

Write operation using builk_insert fails, when writing to non-empty Hudi table. It does not fail if table is empty.

To Reproduce

Steps to reproduce the behavior:

Create new table and write some data with bulk_insert option.
Write the same data batch to this table with bulk_insert option.

Hudi settings:

val unpartitionDataConfig = Map(
    HIVE_PARTITION_EXTRACTOR_CLASS.key -> classOf[
      NonPartitionedExtractor
    ].getName,
    KEYGENERATOR_CLASS_NAME.key -> classOf[NonpartitionedKeyGenerator].getName
  )

private def options(
      table: String,
      primaryKey: String,
      database: String,
      operation: String
  ): Map[String, String] =
    Map(
      OPERATION.key -> operation,
      PRECOMBINE_FIELD.key -> EventTimestampColumn,
      RECORDKEY_FIELD.key -> primaryKey,
      TABLE_TYPE.key -> COW_TABLE_TYPE_OPT_VAL,
      TBL_NAME.key -> table,
      "hoodie.consistency.check.enabled" -> "true",
      HIVE_SYNC_MODE.key -> "jdbc",
      HIVE_SYNC_ENABLED.key -> "true",
      HIVE_SUPPORT_TIMESTAMP_TYPE.key -> "true",
      HIVE_DATABASE.key -> database,
      HIVE_TABLE.key -> table,
      UPSERT_PARALLELISM_VALUE.key -> "4",
      DELETE_PARALLELISM_VALUE.key -> "4",
      BULKINSERT_PARALLELISM_VALUE.key -> "4"
    ) ++ unpartitionDataConfig

  def writerOptions(
      table: String,
      primaryKey: String,
      database: String
  ): Map[String, String] = {
    val operation = BULK_INSERT_OPERATION_OPT_VAL
    options(
      table,
      primaryKey,
      database,
      operation
    ) ++ unpartitionDataConfig
  }

Spark main code:

val options = writerOptions(
    tableName,
    primaryKey,
    database
  )
session.read.format("parquet")
.load(inputPath)
.write
.format("hudi")
.options(options)
.mode(SaveMode.Overwrite)
.save(targetPath)

Expected behavior

Data is overwritten when second step is finished. No logical duplicates in the table.

Environment Description

Hudi version : 0.9 (“org.apache.hudi” %% “hudi-spark3-bundle” % “0.9.0”) Self-package in fat jar with Spark app.
Spark version : 3.1.2 (EMR)
Hive version : AWS Glue
Hadoop version : Hadoop 3.2.1 (EMR)
Storage (HDFS/S3/GCS…) : S3
Running on Docker? (yes/no) : no

Additional context

It is dynamic issues. Sometimes it works fine, so that my Spark job can successfully overwrite the table with bulk_insert.
I am using JVM concurrency via Scala Spark code by writing several tables via Spark in parallel. Perhaps that leads to some Hudi / Spark thread-safety issue?

Stacktrace

21/10/07 12:03:18 INFO YarnScheduler: Killing all running tasks in stage 17: Stage cancelled
21/10/07 12:03:18 INFO DAGScheduler: ResultStage 17 (save at HoodieSparkSqlWriter.scala:463) failed in 3.282 s due to Job aborted due to stage failure: Task 0 in stage 17.0 failed 4 times, most recent failure: Lost task 0.3 in stage 17.0 (TID 32) (ip-10-100-160-252.....local executor 1): org.apache.spark.SparkException: Failed to execute user defined function(UDFRegistration$$Lambda$2098/1888531409: (struct<here comes my table schema in struct format.... it has many columns and they have different logical types>) => string)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.writeFields_0_0$(Unknown Source)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source)
	at scala.collection.Iterator$$anon$10.next(Iterator.scala:459)
	at scala.collection.Iterator$$anon$10.next(Iterator.scala:459)
	at scala.collection.Iterator$$anon$10.next(Iterator.scala:459)
	at org.apache.spark.util.random.SamplingUtils$.reservoirSampleAndCount(SamplingUtils.scala:42)
	at org.apache.spark.RangePartitioner$.$anonfun$sketch$1(Partitioner.scala:306)
	at org.apache.spark.RangePartitioner$.$anonfun$sketch$1$adapted(Partitioner.scala:304)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsWithIndex$2(RDD.scala:915)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsWithIndex$2$adapted(RDD.scala:915)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
	at org.apache.spark.scheduler.Task.run(Task.scala:131)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.hudi.exception.HoodieKeyException: recordKey value: "null" for field: "my_primary_key_column_here" cannot be null or empty.
	at org.apache.hudi.keygen.KeyGenUtils.getRecordKey(KeyGenUtils.java:141)
	at org.apache.hudi.keygen.NonpartitionedAvroKeyGenerator.getRecordKey(NonpartitionedAvroKeyGenerator.java:60)
	at org.apache.hudi.keygen.NonpartitionedKeyGenerator.getRecordKey(NonpartitionedKeyGenerator.java:50)
	at org.apache.hudi.keygen.BaseKeyGenerator.getKey(BaseKeyGenerator.java:62)
	at org.apache.hudi.keygen.BuiltinKeyGenerator.getRecordKey(BuiltinKeyGenerator.java:75)
	at org.apache.spark.sql.UDFRegistration.$anonfun$register$352(UDFRegistration.scala:777)
	... 22 more

Issue Analytics

State:
Created 2 years ago
Comments:16 (8 by maintainers)

Top GitHub Comments

1reaction

xushiyancommented, Oct 21, 2021

@novakov-alexey @Carl-Zhou-CN Thanks for making the fix and the test. such a great collaboration!

As JIRA was filed here, closing this issue.

https://issues.apache.org/jira/browse/HUDI-2582

0reactions

sampath-tripuramallucommented, Feb 21, 2022

when we are using complexkeyGenerator , Primarykeycolumns,partition key columns and Precombinekeycolumn are casesensitive and we should using the right case in conf file or hudi write options

Top Results From Across the Web

HoodieKeyException Is Reported When Data Is ... - 华为云

Is it possible to use a nullable field that contains null records as a primary key when creating a Hudi table?No. HoodieKeyException will...

More than 1 column in record key in spark Hudi Job while ...

Below is the data I am trying to write using apache spark framework. ... HoodieKeyException: recordKey value: "null" for field: "albumId, ...

[#HUDI-2307] When using delete_partition with ds should not ...

Caused by: org.apache.hudi.exception.HoodieKeyException: recordKey value: "null" for field: "uuid" cannot be null or empty.

[jira] [Updated] (HUDI-2307) Fix the need for a primary key ...

HoodieKeyException: recordKey value: "null" for field: "uuid" cannot be null or empty.Caused by: org.apache.hudi.exception.

Index or primary key cannot contain a Null value (Error 3058)

Have questions or feedback about Office VBA or this documentation? Please see Office VBA support and feedback for guidance about the ways you ......

Troubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.

Start Free

Top Related Reddit Thread

No results found

Top Related Tweet

No results found

Top Related Dev.to Post

No results found

[SUPPORT] HoodieKeyException: recordKey value: "null"

Issue Analytics

Top GitHub Comments

Top Results From Across the Web

Top Related Medium Post

Top Related StackOverflow Question

Troubleshoot Live Code

Top Related Reddit Thread

Top Related Hackernoon Post

Top Related Tweet

Top Related Dev.to Post

Top Related Hashnode Post

[SUPPORT] - AWS Athena snapshot query fails if there are two or more record array fields in a MoR table

[SUPPORT] Issues when writing dataframe to hudi format with hive syncing enabled for AWS Athena and Glue metadata persistence