question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[SUPPORT] AWSDmsAvroPayload does not work correctly with any version above 0.10.0

See original GitHub issue

Describe the problem you faced We are getting Full Load + CDC data from a RDBMS using AWS Database Migration Service into an S3 bucket. We then use Hudi in a Scala Glue Job to concatenate the files into a correct representation of the current status of the database. DMS adds two columns to the data: Op (with values null, I, U or D) and ts (timestamp of the operation). We are not using Hive or Avro.

This works fine with Hudi 0.9.0 and Hudi 0.10.0. Once we try to upgrade to Hudi 0.11.0, 0.11.1 or 0.12.0, AWSDmsAvroPayload fails with the following error:

33061 [consumer-thread-1] ERROR org.apache.hudi.io.HoodieWriteHandle  - Error writing record HoodieRecord{key=HoodieKey { recordKey=id:3 partitionPath=}, currentLocation='null', newLocation='null'}
java.util.NoSuchElementException: No value present in Option
        at org.apache.hudi.common.util.Option.get(Option.java:89)
        at org.apache.hudi.common.model.AWSDmsAvroPayload.getInsertValue(AWSDmsAvroPayload.java:72)
        at org.apache.hudi.execution.HoodieLazyInsertIterable$HoodieInsertValueGenResult.<init>(HoodieLazyInsertIterable.java:90)
        at org.apache.hudi.execution.HoodieLazyInsertIterable.lambda$getTransformFunction$0(HoodieLazyInsertIterable.java:103)
        at org.apache.hudi.common.util.queue.BoundedInMemoryQueue.insertRecord(BoundedInMemoryQueue.java:190)
        at org.apache.hudi.common.util.queue.IteratorBasedQueueProducer.produce(IteratorBasedQueueProducer.java:46)
        at org.apache.hudi.common.util.queue.BoundedInMemoryExecutor.lambda$null$0(BoundedInMemoryExecutor.java:106)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)

Removing the PAYLOAD_CLASS_OPT_KEY option from the config makes it so that the Job doesn’t fail, but the delete operations are not applied. No other payload class seems to work with the DMS format.

Steps to reproduce the behavior Dependencies:

"org.apache.hudi" %% "hudi-spark-bundle" % "2.12-0.12.0"
"org.apache.hudi" %% "hudi-utilities-bundle" % "2.12-0.12.0"

Configuration used:

var hudiOptions = scala.collection.mutable.Map[String, String](
      HoodieWriteConfig.TABLE_NAME -> "hudiTableName",
      HoodieWriteConfig.COMBINE_BEFORE_INSERT.key() -> "true",
      DataSourceWriteOptions.OPERATION_OPT_KEY -> DataSourceWriteOptions.UPSERT_OPERATION_OPT_VAL,
      DataSourceWriteOptions.TABLE_TYPE_OPT_KEY -> DataSourceWriteOptions.COW_TABLE_TYPE_OPT_VAL,
      DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY -> "primaryKeyField",
      DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY ->  "ts",
      DataSourceWriteOptions.PAYLOAD_CLASS_OPT_KEY -> classOf[AWSDmsAvroPayload].getName,
      DataSourceWriteOptions.KEYGENERATOR_CLASS_OPT_KEY -> classOf[CustomKeyGenerator].getName,
      DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, ""
    )

Following options are added if a partition key is defined:

      hudiOptions.put(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, "partitionKeyField")
      hudiOptions.put(DataSourceWriteOptions.HIVE_STYLE_PARTITIONING_OPT_KEY, "true")
      hudiOptions.put(HoodieIndexConfig.INDEX_TYPE.key(), "GLOBAL_BLOOM")
      hudiOptions.put(HoodieIndexConfig.BLOOM_INDEX_UPDATE_PARTITION_PATH_ENABLE.key(), "true")
      hudiOptions.put(DataSourceWriteOptions.DROP_PARTITION_COLUMNS.key(), "true")

Saved into a file:

    // Write the DataFrame as a Hudi dataset
    mappedDF
      .dropDuplicates()
      .write
      .format("org.apache.hudi")
      .options(hudiOptions)
      .mode(SaveMode.Append)
      .save("targetDirectory")

Expected behavior Data obtained from using Hudi reflects the data present in the DB.

Environment Description

  • Hudi version : 0.12.0
  • Spark version : 3.1.1
  • Scala version: 2.12.15
  • AWS Glue version : 3.0.0

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:8 (7 by maintainers)

github_iconTop GitHub Comments

1reaction
nsivabalancommented, Sep 12, 2022

closing this as we have a fix. thanks for reporting.

1reaction
yihuacommented, Sep 8, 2022

@rahil-c and I discussed this today. The proper fix is to call the corresponding API instead of repeating the invocation of handleDeleteOperation:

FIXED ->
@Override
  public Option<IndexedRecord> getInsertValue(Schema schema, Properties properties) throws IOException {
    return getInsertValue(schema);
  }

  @Override
  public Option<IndexedRecord> getInsertValue(Schema schema) throws IOException {
    IndexedRecord insertValue = super.getInsertValue(schema).get();
    return handleDeleteOperation(insertValue);
  }

@Override
  public Option<IndexedRecord> combineAndGetUpdateValue(IndexedRecord currentValue, Schema schema, Properties properties)
      throws IOException {
    return combineAndGetUpdateValue(currentValue, schema);
  }

  @Override
  public Option<IndexedRecord> combineAndGetUpdateValue(IndexedRecord currentValue, Schema schema)
      throws IOException {
    IndexedRecord insertValue = super.getInsertValue(schema).get();
    return handleDeleteOperation(insertValue);
  }

@rahil-c will put up a fix.

Read more comments on GitHub >

github_iconTop Results From Across the Web

GI Tracker Board - GitHub
[SUPPORT] When writing data with 'hoodie.datasource.write.payload.class' = 'org.apache.hudi.payload.AWSDmsAvroPayload' the 'Op' column is written to hudi, ...
Read more >
Change Capture Using AWS Database Migration Service and ...
We use a special payload class - AWSDMSAvroPayload , to handle the different change operations correctly. The parquet files generated have an Op ......
Read more >
New features from Apache Hudi 0.9.0 on Amazon EMR
Spark SQL DML and DDL support. The most exciting new feature is that Apache Hudi 0.9. 0 adds support for DDL/DMLs using Spark...
Read more >
Tag Archives: Best practices - Noise
With over 200 AWS services, most customer workloads can run in the AWS Regions. However, for some location-sensitive workloads with low-latency or data ......
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found