question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Schema Evolution: Missing column for previous records when new entry does not have the same while upsert.

See original GitHub issue

Hi Team,

We are currently evaluating Hudi for our analytical use cases and as part of this exercise we are facing few issues with schema evolution and data loss. The current issue which we have encountered is while updating a record. We have currently inserted a single record with the following schema root |-- birthDate: string (nullable = true) |-- gender: string (nullable = true) |-- id: string (nullable = true) |-- lastUpdated: string (nullable = true) |-- maritalStatus: struct (nullable = true) | |-- coding: array (nullable = true) | | |-- element: struct (containsNull = true) | | | |-- code: string (nullable = true) | | | |-- display: string (nullable = true) | | | |-- system: string (nullable = true) | |-- text: string (nullable = true) |-- resourceType: string (nullable = true) |-- source: string (nullable = true)

now when we insert the new data with the following schema

root |-- birthDate: string (nullable = true) |-- gender: string (nullable = true) |-- id: string (nullable = true) |-- lastUpdated: string (nullable = true) |-- multipleBirthBoolean: boolean (nullable = true) |-- resourceType: string (nullable = true) |-- source: string (nullable = true)

The update is successful but the schema is missing the
|-- maritalStatus: struct (nullable = true) | |-- coding: array (nullable = true) | | |-- element: struct (containsNull = true) | | | |-- code: string (nullable = true) | | | |-- display: string (nullable = true) | | | |-- system: string (nullable = true) | |-- text: string (nullable = true)

field. our expected behaviour was that after adding the second entry, the new column “multipleBirthBoolean” will be added to the overall schema and the previous column “maritalStatus” struct will be retained and will be null for the second entry. The final schema looks like this, root |-- _hoodie_commit_time: string (nullable = true) |-- _hoodie_commit_seqno: string (nullable = true) |-- _hoodie_record_key: string (nullable = true) |-- _hoodie_partition_path: string (nullable = true) |-- _hoodie_file_name: string (nullable = true) |-- birthDate: string (nullable = true) |-- gender: string (nullable = true) |-- id: string (nullable = true) |-- lastUpdated: string (nullable = true) |-- multipleBirthBoolean: boolean (nullable = true) |-- resourceType: string (nullable = true) |-- source: string (nullable = true)

Basically when a new entry is added and it is missing a column from the destination schema the update is successful and the missing column vanishes from the previous entries. Let us know if we are missing any configuration options. We cannot control the schema as its defined by FHIR standards (https://www.hl7.org/fhir/patient.html#resource) most of the fields here are optional so the incoming data from our customers will be missing certain columns.

Environment Description

  • Hudi version : 0.12.0-SNAPSHOT

  • Spark version : 3.2.1

  • Hive version :

  • Hadoop version :

  • Storage (HDFS/S3/GCS…) : Local

  • Running on Docker? (yes/no) : no

Thanks for the help.

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:28 (12 by maintainers)

github_iconTop GitHub Comments

1reaction
santoshsbcommented, May 23, 2022

thanks @xiarixiaoyao, our schema for storing data as defined by FHIR standards https://www.hl7.org/fhir/patient.schema.json.html seams to be complicated, as most of the fields here are optional the incoming data will always be missing few elements (nested as well as those on the root level). The missing root element is fixed by the code you provided, we are thinking on how to work with the nested fields missing issues.

0reactions
codopecommented, Sep 8, 2022

Great! Gonna close this issue then. FYI, we also plan to flip the default for schema reconciliation in the next release. See #6196

Read more comments on GitHub >

github_iconTop Results From Across the Web

Schema Evolution and Compatibility - Confluent Documentation
FORWARD compatibility means that data produced with a new schema can be read by consumers using the last schema, even though they may...
Read more >
Update Delta Lake table schema | Databricks on AWS
Delta Lake lets you update the schema of a table. The following types of changes are supported: Adding new columns (at arbitrary positions)....
Read more >
Setting crawler configuration options - AWS Glue
Add new columns, remove missing columns, and modify the definitions of existing columns. Remove any metadata that is not set by the crawler....
Read more >
Troubleshooting - Apache Hudi
This error generally occurs when the schema has evolved in backwards incompatible way by deleting some column 'col1' and we are trying to...
Read more >
4. Encoding and Evolution - Designing Data-Intensive ...
CSV does not have any schema, so it is up to the application to define the meaning of each row and column. If...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found