question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[SUPPORT] Upsert overwrting ordering field with invalid value

See original GitHub issue

Describe the problem you faced

I’m writing an application to upsert records from a table. The problem is when an upsert operation is done, the ordering column of records that exists in base table and not exists in incoming data is overwritten to invalid value. E.g. The base table has a record with id = 1 and createddate = 2022-04-01 The incoming data has a record with id = 2 and createddate = 2022-04-02

After upsert operation the createddate of record with id = 1 is changed to 1970-xx-xx and the record with id = 2 remains intact.

To Reproduce

from pyspark.sql.functions import expr
from pyspark.sql import DataFrame, SparkSession

database = 'db'
table = 'tb'
table_path = f'/{database}/{table}'

spark = SparkSession.builder.config(
    'spark.sql.shuffle.partitions', '4').enableHiveSupport().getOrCreate()

options = {
    'hoodie.datasource.write.keygenerator.class': 'org.apache.hudi.keygen.CustomKeyGenerator',
    'hoodie.datasource.write.recordkey.field': 'id',
    'hoodie.datasource.write.partitionpath.field': 'field:simple',
    'hoodie.datasource.write.precombine.field': 'createddate',
    'hoodie.payload.event.time.field': 'createddate',
    'hoodie.datasource.write.table.type': 'COPY_ON_WRITE',
    'hoodie.table.name': table,

    'hoodie.datasource.write.hive_style_partitioning': 'true',
    'hoodie.datasource.hive_sync.partition_extractor_class': 'org.apache.hudi.hive.MultiPartKeysValueExtractor',
    'hoodie.datasource.hive_sync.enable': 'true',
    'hoodie.datasource.hive_sync.mode': 'hms',
    'hoodie.datasource.hive_sync.support_timestamp': 'true',
    'hoodie.datasource.hive_sync.database': database,
    'hoodie.datasource.hive_sync.table': table,
    'hoodie.datasource.hive_sync.partition_fields': 'field',

}

full = spark.read.parquet(
    '/opt/spark/conf/full/')
delta = spark.read.json(
    '/opt/spark/conf/delta')

full_parse: DataFrame = full \
    .withColumn('createddate', expr(f'cast(substr(createddate, 1, 19) as timestamp)'))

delta_parse: DataFrame = delta \
    .withColumn('createddate', expr(f'cast(substr(createddate, 1, 19) as timestamp)'))

full_parse \
    .write \
    .format('org.apache.hudi') \
    .options(**options) \
    .option('hoodie.datasource.write.operation', 'bulk_insert') \
    .mode('overwrite') \
    .save(table_path)

delta_parse \
    .write \
    .format('org.apache.hudi') \
    .options(**options) \
    .option('hoodie.datasource.write.operation', 'upsert') \
    .mode('append') \
    .save(table_path)

Example full file content

+------------------+-------------------+-----------+-------------------+------------------+---------+----------------+--------+------------------+
|createdbyid       |createddate        |datatype   |field              |id                |isdeleted|newvalue        |oldvalue|parentid          |
+------------------+-------------------+-----------+-------------------+------------------+---------+----------------+--------+------------------+
|0055G00000808dFQAQ|2022-03-16 16:55:13|DynamicEnum|Status_do_Imovel__c|0175G0000jIvmN7QAJ|false    |Visita Cancelada|null    |a015G00000kpbM3QAI|
+------------------+-------------------+-----------+-------------------+------------------+---------+----------------+--------+------------------+

After upsert operation

+------------------+-----------------------+-----------+-------------------+------------------+---------+----------------+----------------------+------------------+
|createdbyid       |createddate            |datatype   |field              |id                |isdeleted|newvalue        |oldvalue              |parentid          |
+------------------+-----------------------+-----------+-------------------+------------------+---------+----------------+----------------------+------------------+
|0055G00000808dFQAQ|1970-01-20 01:37:29.713|DynamicEnum|Status_do_Imovel__c|0175G0000jIvmN7QAJ|false    |Visita Cancelada|null                  |a015G00000kpbM3QAI|
+------------------+-----------------------+-----------+-------------------+------------------+---------+----------------+----------------------+------------------+

Obs: A random number of records is affected by this bug. For each execution a different number of records is affected.

1rst execution

spark-sql> select count(id) from db.tb where createddate < '1971-01-01';
97801

2nd execution

spark-sql> select count(id) from db.tb where createddate < '1971-01-01';
76356

Environment Description

  • Hudi version : 0.10.0

  • Spark version : 3.1.2

  • Storage (HDFS/S3/GCS…) : Local

  • Running on Docker? (yes/no) : Yes

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:5 (3 by maintainers)

github_iconTop GitHub Comments

1reaction
yihuacommented, May 2, 2022

@jasondavindev Thanks for confirming. If this is solved by Hudi 0.11.0 release and there is no other ask for this issue, feel free to close it.

0reactions
yihuacommented, Jun 29, 2022

@jasondavindev to clarify, do you still see issues with BULK_INSERT in 0.11.0?

Read more comments on GitHub >

github_iconTop Results From Across the Web

subject:"\[GitHub\] \[hudi\] yihua commented on ... - The Mail Archive
... commented on issue #5469: [SUPPORT] Upsert overwrting ordering field with invalid value ... This is an automated message from the Apache Git...
Read more >
Documentation: 15: INSERT - PostgreSQL
col = 1 is invalid (this follows the general behavior for UPDATE ). OVERRIDING SYSTEM VALUE. If this clause is specified, then any...
Read more >
FIX: "Invalid Object Name" Error When Updating by Stored ...
The following error is returned, where object name is the name of the table you are attempting to update: Invalid object name object...
Read more >
Insert or Update (Upsert) a Record Using an External ID
You can use the sObject Rows by External ID resource to create records or update existing records (upsert) based on the value of...
Read more >
MySQL error code: 1175 during UPDATE in MySQL Workbench
It looks like your MySql session has the safe-updates option set. This means that you can't update or delete records without specifying a...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found