Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Schema Evolution: Missing column for previous records when new entry does not have the same while upsert.

See original GitHub issue

Hi Team,

We are currently evaluating Hudi for our analytical use cases and as part of this exercise we are facing few issues with schema evolution and data loss. The current issue which we have encountered is while updating a record. We have currently inserted a single record with the following schema root |-- birthDate: string (nullable = true) |-- gender: string (nullable = true) |-- id: string (nullable = true) |-- lastUpdated: string (nullable = true) |-- maritalStatus: struct (nullable = true) | |-- coding: array (nullable = true) | | |-- element: struct (containsNull = true) | | | |-- code: string (nullable = true) | | | |-- display: string (nullable = true) | | | |-- system: string (nullable = true) | |-- text: string (nullable = true) |-- resourceType: string (nullable = true) |-- source: string (nullable = true)

now when we insert the new data with the following schema

field. our expected behaviour was that after adding the second entry, the new column “multipleBirthBoolean” will be added to the overall schema and the previous column “maritalStatus” struct will be retained and will be null for the second entry. The final schema looks like this, root |-- _hoodie_commit_time: string (nullable = true) |-- _hoodie_commit_seqno: string (nullable = true) |-- _hoodie_record_key: string (nullable = true) |-- _hoodie_partition_path: string (nullable = true) |-- _hoodie_file_name: string (nullable = true) |-- birthDate: string (nullable = true) |-- gender: string (nullable = true) |-- id: string (nullable = true) |-- lastUpdated: string (nullable = true) |-- multipleBirthBoolean: boolean (nullable = true) |-- resourceType: string (nullable = true) |-- source: string (nullable = true)

Basically when a new entry is added and it is missing a column from the destination schema the update is successful and the missing column vanishes from the previous entries. Let us know if we are missing any configuration options. We cannot control the schema as its defined by FHIR standards (https://www.hl7.org/fhir/patient.html#resource) most of the fields here are optional so the incoming data from our customers will be missing certain columns.

Environment Description

Hudi version : 0.12.0-SNAPSHOT
Spark version : 3.2.1
Hive version :
Hadoop version :
Storage (HDFS/S3/GCS…) : Local
Running on Docker? (yes/no) : no

Thanks for the help.

Issue Analytics

State:
Created a year ago
Comments:28 (12 by maintainers)

Top GitHub Comments

1reaction

santoshsbcommented, May 23, 2022

thanks @xiarixiaoyao, our schema for storing data as defined by FHIR standards https://www.hl7.org/fhir/patient.schema.json.html seams to be complicated, as most of the fields here are optional the incoming data will always be missing few elements (nested as well as those on the root level). The missing root element is fixed by the code you provided, we are thinking on how to work with the nested fields missing issues.

0reactions

codopecommented, Sep 8, 2022

Great! Gonna close this issue then. FYI, we also plan to flip the default for schema reconciliation in the next release. See #6196

Top Results From Across the Web

Schema Evolution and Compatibility - Confluent Documentation

FORWARD compatibility means that data produced with a new schema can be read by consumers using the last schema, even though they may...

Update Delta Lake table schema | Databricks on AWS

Delta Lake lets you update the schema of a table. The following types of changes are supported: Adding new columns (at arbitrary positions)....

Setting crawler configuration options - AWS Glue

Add new columns, remove missing columns, and modify the definitions of existing columns. Remove any metadata that is not set by the crawler....

Troubleshooting - Apache Hudi

This error generally occurs when the schema has evolved in backwards incompatible way by deleting some column 'col1' and we are trying to...

4. Encoding and Evolution - Designing Data-Intensive ...

CSV does not have any schema, so it is up to the application to define the meaning of each row and column. If...

Troubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.

Start Free

Top Related Reddit Thread

No results found

Top Related Tweet

No results found

Top Related Dev.to Post

No results found

Schema Evolution: Missing column for previous records when new entry does not have the same while upsert.

Issue Analytics

Top GitHub Comments

Top Results From Across the Web

Top Related Medium Post

Top Related StackOverflow Question

Troubleshoot Live Code

Top Related Reddit Thread

Top Related Hackernoon Post

Top Related Tweet

Top Related Dev.to Post

Top Related Hashnode Post

[SUPPORT] Read Hudi Table from Hive/Glue Catalog without specifying the S3 Path

[SUPPORT] Hudi 0.10.1 raises exception java.lang.NoClassDefFoundError: com/amazonaws/services/dynamodbv2/model/LockNotGrantedException