Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[SUPPORT] AWSDmsAvroPayload does not work correctly with any version above 0.10.0

See original GitHub issue

Describe the problem you faced We are getting Full Load + CDC data from a RDBMS using AWS Database Migration Service into an S3 bucket. We then use Hudi in a Scala Glue Job to concatenate the files into a correct representation of the current status of the database. DMS adds two columns to the data: Op (with values null, I, U or D) and ts (timestamp of the operation). We are not using Hive or Avro.

This works fine with Hudi 0.9.0 and Hudi 0.10.0. Once we try to upgrade to Hudi 0.11.0, 0.11.1 or 0.12.0, AWSDmsAvroPayload fails with the following error:

33061 [consumer-thread-1] ERROR org.apache.hudi.io.HoodieWriteHandle  - Error writing record HoodieRecord{key=HoodieKey { recordKey=id:3 partitionPath=}, currentLocation='null', newLocation='null'}
java.util.NoSuchElementException: No value present in Option
        at org.apache.hudi.common.util.Option.get(Option.java:89)
        at org.apache.hudi.common.model.AWSDmsAvroPayload.getInsertValue(AWSDmsAvroPayload.java:72)
        at org.apache.hudi.execution.HoodieLazyInsertIterable$HoodieInsertValueGenResult.<init>(HoodieLazyInsertIterable.java:90)
        at org.apache.hudi.execution.HoodieLazyInsertIterable.lambda$getTransformFunction$0(HoodieLazyInsertIterable.java:103)
        at org.apache.hudi.common.util.queue.BoundedInMemoryQueue.insertRecord(BoundedInMemoryQueue.java:190)
        at org.apache.hudi.common.util.queue.IteratorBasedQueueProducer.produce(IteratorBasedQueueProducer.java:46)
        at org.apache.hudi.common.util.queue.BoundedInMemoryExecutor.lambda$null$0(BoundedInMemoryExecutor.java:106)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)

Removing the PAYLOAD_CLASS_OPT_KEY option from the config makes it so that the Job doesn’t fail, but the delete operations are not applied. No other payload class seems to work with the DMS format.

Steps to reproduce the behavior Dependencies:

"org.apache.hudi" %% "hudi-spark-bundle" % "2.12-0.12.0"
"org.apache.hudi" %% "hudi-utilities-bundle" % "2.12-0.12.0"

Configuration used:

var hudiOptions = scala.collection.mutable.Map[String, String](
      HoodieWriteConfig.TABLE_NAME -> "hudiTableName",
      HoodieWriteConfig.COMBINE_BEFORE_INSERT.key() -> "true",
      DataSourceWriteOptions.OPERATION_OPT_KEY -> DataSourceWriteOptions.UPSERT_OPERATION_OPT_VAL,
      DataSourceWriteOptions.TABLE_TYPE_OPT_KEY -> DataSourceWriteOptions.COW_TABLE_TYPE_OPT_VAL,
      DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY -> "primaryKeyField",
      DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY ->  "ts",
      DataSourceWriteOptions.PAYLOAD_CLASS_OPT_KEY -> classOf[AWSDmsAvroPayload].getName,
      DataSourceWriteOptions.KEYGENERATOR_CLASS_OPT_KEY -> classOf[CustomKeyGenerator].getName,
      DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, ""
    )

Following options are added if a partition key is defined:

      hudiOptions.put(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, "partitionKeyField")
      hudiOptions.put(DataSourceWriteOptions.HIVE_STYLE_PARTITIONING_OPT_KEY, "true")
      hudiOptions.put(HoodieIndexConfig.INDEX_TYPE.key(), "GLOBAL_BLOOM")
      hudiOptions.put(HoodieIndexConfig.BLOOM_INDEX_UPDATE_PARTITION_PATH_ENABLE.key(), "true")
      hudiOptions.put(DataSourceWriteOptions.DROP_PARTITION_COLUMNS.key(), "true")

Saved into a file:

    // Write the DataFrame as a Hudi dataset
    mappedDF
      .dropDuplicates()
      .write
      .format("org.apache.hudi")
      .options(hudiOptions)
      .mode(SaveMode.Append)
      .save("targetDirectory")

Expected behavior Data obtained from using Hudi reflects the data present in the DB.

Environment Description

Hudi version : 0.12.0
Spark version : 3.1.1
Scala version: 2.12.15
AWS Glue version : 3.0.0

Issue Analytics

State:
Created a year ago
Comments:8 (7 by maintainers)

Top GitHub Comments

1reaction

nsivabalancommented, Sep 12, 2022

closing this as we have a fix. thanks for reporting.

1reaction

yihuacommented, Sep 8, 2022

@rahil-c and I discussed this today. The proper fix is to call the corresponding API instead of repeating the invocation of handleDeleteOperation:

FIXED ->
@Override
  public Option<IndexedRecord> getInsertValue(Schema schema, Properties properties) throws IOException {
    return getInsertValue(schema);
  }

  @Override
  public Option<IndexedRecord> getInsertValue(Schema schema) throws IOException {
    IndexedRecord insertValue = super.getInsertValue(schema).get();
    return handleDeleteOperation(insertValue);
  }

@Override
  public Option<IndexedRecord> combineAndGetUpdateValue(IndexedRecord currentValue, Schema schema, Properties properties)
      throws IOException {
    return combineAndGetUpdateValue(currentValue, schema);
  }

  @Override
  public Option<IndexedRecord> combineAndGetUpdateValue(IndexedRecord currentValue, Schema schema)
      throws IOException {
    IndexedRecord insertValue = super.getInsertValue(schema).get();
    return handleDeleteOperation(insertValue);
  }

@rahil-c will put up a fix.

Top Results From Across the Web

GI Tracker Board - GitHub

[SUPPORT] When writing data with 'hoodie.datasource.write.payload.class' = 'org.apache.hudi.payload.AWSDmsAvroPayload' the 'Op' column is written to hudi, ...

Change Capture Using AWS Database Migration Service and ...

We use a special payload class - AWSDMSAvroPayload , to handle the different change operations correctly. The parquet files generated have an Op ......

New features from Apache Hudi 0.9.0 on Amazon EMR

Spark SQL DML and DDL support. The most exciting new feature is that Apache Hudi 0.9. 0 adds support for DDL/DMLs using Spark...

Tag Archives: Best practices - Noise

With over 200 AWS services, most customer workloads can run in the AWS Regions. However, for some location-sensitive workloads with low-latency or data ......