Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[SUPPORT] Upsert overwrting ordering field with invalid value

See original GitHub issue

Describe the problem you faced

I’m writing an application to upsert records from a table. The problem is when an upsert operation is done, the ordering column of records that exists in base table and not exists in incoming data is overwritten to invalid value. E.g. The base table has a record with id = 1 and createddate = 2022-04-01 The incoming data has a record with id = 2 and createddate = 2022-04-02

After upsert operation the createddate of record with id = 1 is changed to 1970-xx-xx and the record with id = 2 remains intact.

To Reproduce

from pyspark.sql.functions import expr
from pyspark.sql import DataFrame, SparkSession

database = 'db'
table = 'tb'
table_path = f'/{database}/{table}'

spark = SparkSession.builder.config(
    'spark.sql.shuffle.partitions', '4').enableHiveSupport().getOrCreate()

options = {
    'hoodie.datasource.write.keygenerator.class': 'org.apache.hudi.keygen.CustomKeyGenerator',
    'hoodie.datasource.write.recordkey.field': 'id',
    'hoodie.datasource.write.partitionpath.field': 'field:simple',
    'hoodie.datasource.write.precombine.field': 'createddate',
    'hoodie.payload.event.time.field': 'createddate',
    'hoodie.datasource.write.table.type': 'COPY_ON_WRITE',
    'hoodie.table.name': table,

    'hoodie.datasource.write.hive_style_partitioning': 'true',
    'hoodie.datasource.hive_sync.partition_extractor_class': 'org.apache.hudi.hive.MultiPartKeysValueExtractor',
    'hoodie.datasource.hive_sync.enable': 'true',
    'hoodie.datasource.hive_sync.mode': 'hms',
    'hoodie.datasource.hive_sync.support_timestamp': 'true',
    'hoodie.datasource.hive_sync.database': database,
    'hoodie.datasource.hive_sync.table': table,
    'hoodie.datasource.hive_sync.partition_fields': 'field',

}

full = spark.read.parquet(
    '/opt/spark/conf/full/')
delta = spark.read.json(
    '/opt/spark/conf/delta')

full_parse: DataFrame = full \
    .withColumn('createddate', expr(f'cast(substr(createddate, 1, 19) as timestamp)'))

delta_parse: DataFrame = delta \
    .withColumn('createddate', expr(f'cast(substr(createddate, 1, 19) as timestamp)'))

full_parse \
    .write \
    .format('org.apache.hudi') \
    .options(**options) \
    .option('hoodie.datasource.write.operation', 'bulk_insert') \
    .mode('overwrite') \
    .save(table_path)

delta_parse \
    .write \
    .format('org.apache.hudi') \
    .options(**options) \
    .option('hoodie.datasource.write.operation', 'upsert') \
    .mode('append') \
    .save(table_path)

Example full file content

+------------------+-------------------+-----------+-------------------+------------------+---------+----------------+--------+------------------+
|createdbyid       |createddate        |datatype   |field              |id                |isdeleted|newvalue        |oldvalue|parentid          |
+------------------+-------------------+-----------+-------------------+------------------+---------+----------------+--------+------------------+
|0055G00000808dFQAQ|2022-03-16 16:55:13|DynamicEnum|Status_do_Imovel__c|0175G0000jIvmN7QAJ|false    |Visita Cancelada|null    |a015G00000kpbM3QAI|
+------------------+-------------------+-----------+-------------------+------------------+---------+----------------+--------+------------------+

After upsert operation

+------------------+-----------------------+-----------+-------------------+------------------+---------+----------------+----------------------+------------------+
|createdbyid       |createddate            |datatype   |field              |id                |isdeleted|newvalue        |oldvalue              |parentid          |
+------------------+-----------------------+-----------+-------------------+------------------+---------+----------------+----------------------+------------------+
|0055G00000808dFQAQ|1970-01-20 01:37:29.713|DynamicEnum|Status_do_Imovel__c|0175G0000jIvmN7QAJ|false    |Visita Cancelada|null                  |a015G00000kpbM3QAI|
+------------------+-----------------------+-----------+-------------------+------------------+---------+----------------+----------------------+------------------+

Obs: A random number of records is affected by this bug. For each execution a different number of records is affected.

1rst execution

spark-sql> select count(id) from db.tb where createddate < '1971-01-01';
97801

2nd execution

spark-sql> select count(id) from db.tb where createddate < '1971-01-01';
76356

Environment Description

Hudi version : 0.10.0
Spark version : 3.1.2
Storage (HDFS/S3/GCS…) : Local
Running on Docker? (yes/no) : Yes

Issue Analytics

State:
Created a year ago
Comments:5 (3 by maintainers)

Top GitHub Comments

1reaction

yihuacommented, May 2, 2022

@jasondavindev Thanks for confirming. If this is solved by Hudi 0.11.0 release and there is no other ask for this issue, feel free to close it.

0reactions

yihuacommented, Jun 29, 2022

@jasondavindev to clarify, do you still see issues with BULK_INSERT in 0.11.0?

Top Results From Across the Web

subject:"\[GitHub\] \[hudi\] yihua commented on ... - The Mail Archive

... commented on issue #5469: [SUPPORT] Upsert overwrting ordering field with invalid value ... This is an automated message from the Apache Git...

Documentation: 15: INSERT - PostgreSQL

col = 1 is invalid (this follows the general behavior for UPDATE ). OVERRIDING SYSTEM VALUE. If this clause is specified, then any...

FIX: "Invalid Object Name" Error When Updating by Stored ...

The following error is returned, where object name is the name of the table you are attempting to update: Invalid object name object...

Insert or Update (Upsert) a Record Using an External ID

You can use the sObject Rows by External ID resource to create records or update existing records (upsert) based on the value of...

MySQL error code: 1175 during UPDATE in MySQL Workbench

It looks like your MySql session has the safe-updates option set. This means that you can't update or delete records without specifying a...

Troubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.

Start Free

Top Related Reddit Thread

No results found

Top Related Tweet

No results found

Top Related Dev.to Post

No results found

[SUPPORT] Upsert overwrting ordering field with invalid value

Issue Analytics

Top GitHub Comments

Top Results From Across the Web

Top Related Medium Post

Top Related StackOverflow Question

Troubleshoot Live Code

Top Related Reddit Thread

Top Related Hackernoon Post

Top Related Tweet

Top Related Dev.to Post

Top Related Hashnode Post

[SUPPORT] Hive Sync + AWS Data Catalog failling with Hudi 0.11.0

[SUPPORT] Read Hudi Table from Hive/Glue Catalog without specifying the S3 Path