question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[SUPPORT] Reconcile schema - missing field dropped from metadata

See original GitHub issue

Describe the problem you faced I’m using schema on read (full schema evolution feature) and reconcile schema feature to evolve hudi table schema, it’s synchronized with Glue Data Catalog. COW table.

I add a column (col_a) in the middle of the table in one batch (upsert operation). In the next batch (upsert) I add new column at the end of the table (col_b) but col_a is missing in data frame. Then I query the table via Athena or via Spark SQL, then col_a is dropped and not visible.

I can upsert next batch with df that contains both col_a and col_b, then all data is visible in Spark and Athena.

I would expect that during the schema reconciliation phase Hudi would handle this case and preserve col_a with a null value.

To Reproduce

Steps to reproduce the behavior:

edit: I used dataFrame api to upsert data into the hudi table

Operations, step by step

Batch seq Operation DF schema Table Schema Expected Table Schema
0 insert col_1: string,col_2: string col_1: string,col_2: string col_1: string,col_2: string
1 upsert col_1: string, col_a: string, col_2: string col_1: string,col_a: string,col_2: string col_1: string,col_a: string,col_2: string
2 upsert col_1: string, col_2: string, col_b: string col_1: string, col_2: string, col_b: string col_1: string, col_a: string, col_2: string, col_b: string

Expected behavior

In batch 2 table should have schema: col_1: string, col_a: string, col_2: string, col_b: string

with col_a preserved with null values where column is missing

Environment Description

  • Hudi version : 0.11.0 OSS

  • Spark version : 3.2.0-amzn

  • Hive version : 3.2.1

  • Hadoop version : 3.2.1

  • Storage (HDFS/S3/GCS…) : S3

  • Running on Docker? (yes/no) : yes/ emr on eks 6.6

Additional context

Stacktrace

Add the stacktrace of the error.

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:8 (8 by maintainers)

github_iconTop GitHub Comments

1reaction
xiarixiaoyaocommented, Jun 16, 2022

@kazdy ok, Thank you for your answer, let me fix this problem in next few days

0reactions
codopecommented, Jun 17, 2022

@kazdy ok, Thank you for your answer, let me fix this problem in next few days

@xiarixiaoyao Sounds good. Closing this ticket. I’ve made it a blocker of 0.12 release.

Read more comments on GitHub >

github_iconTop Results From Across the Web

RFC - 33 Hudi supports more comprehensive Schema Evolution
Based on the existing implementation, add table schema metadata, store business field information in the table metadata, read and write operations based on...
Read more >
Missing field warning displayed in TeamServer for supplied ...
This error indicates that the value supplied in the agent configuration has been picked up, but the value specified for the field is...
Read more >
Append (Data Management)—ArcGIS Pro | Documentation
ArcGIS geoprocessing tool that appends multiple input datasets into an existing target dataset.
Read more >
Assets with Required Metadata Missing Should not be Easily ...
An uploaded asset requires some metadata (such as title and description). The asset is uploaded asynchronously and marked with a red REQUIRED ...
Read more >
Troubleshoot Confluent for Kubernetes
You can create a support bundle to provide Confluent all of the required ... Solution: Delete the Schema Registry deployment and re-deploy once...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found