question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[Feature Request] Support data type changes for schema evolution

See original GitHub issue

Feature request

Overview

Currently, schema changes are eligible for schema evolution during table appends or overwrites:

  • Adding new columns
  • Changing of data types from NullType -> any other type, or upcasts from ByteType -> ShortType -> IntegerType

Is it possible to support more data type changes during append operation with enabled schema evolution?

Motivation

For the history table updates, we want to keep all changed records as a newly appended record instead of overwriting schema when the data type has been changed from source with enabled schema evolution

For example, when the scale value of the decimal type is changed from 2 to 4 and the precision is kept unchanged

//i.e. data type of one column is changed from decimal(38,2) to decimal(38,4)
df.write.format("delta").option("mergeSchema", "true").mode("append").save(targetPath)

The error is Failed to merge decimal types with incompatible scale 2 and 4

Can this decimal scale change be supported in Delta schema evolution during table appends and please review other data types as well?

Willingness to contribute

The Delta Lake Community encourages new feature contributions. Would you or another member of your organization be willing to contribute an implementation of this feature?

  • Yes. I can contribute this feature independently.
  • Yes. I would be willing to contribute this feature with guidance from the Delta Lake community.
  • No. I cannot contribute this feature at this time.

Issue Analytics

  • State:open
  • Created a year ago
  • Reactions:1
  • Comments:12 (6 by maintainers)

github_iconTop GitHub Comments

2reactions
zmeircommented, May 17, 2022

We currently don’t support IntegerType -> LongType

Huh… Why is that really? I naively assumed it should be just the same as ByteType -> ShortType -> IntegerType.

It just so happens that this is the exact error that led me to find this issue:

# df schema is [id: bigint]
df.write.format("delta").save(path)
# new_df schema is [id: int, name: string]
new_df.write.format("delta").mode("append").option("mergeSchema", "true").save(path)
# error: Failed to merge fields 'id' and 'id'. Failed to merge incompatible data types LongType and IntegerType

What’s interesting is that the following workaround actually works:

DeltaTable.forPath(spark, path).merge(new_df, "false").whenNotMatchedInsertAll().execute()
# schema after this command: [id: bigint, name: string]
2reactions
dennygleecommented, May 6, 2022

Thanks @frankyangdev as per our Slack conversations, this one may be a little tricky. Let me summarize some key points and add some additional points thanks to @zsxwing

  • Reviewing the Parquet Logical Type Definitions, it seems that decimals are stored as logical data types backed by integers per Data types in Apache Parquet - thanks @bartosz25
  • When reviewing Spark supported data types I was reminded that right now Delta Lake conversion support is ByteType -> ShortType -> IntegerType. We currently don’t support IntegerType -> LongType. This may be coming into play here - in that Spark is using java.math.BigDecimal to support the conversion but Parquet itself is storing things using the various primitives.
  • To add to it, when reviewing ParquetSchemaConverter.scala L589-L611, this may not be a simple change. Decimal may use different underlying types. When a column has different types in different parquet files, there may be issues to read it back directly.

Saying this, if others can chime in and perhaps I may be missing something here and perhaps over-complicating things?

Read more comments on GitHub >

github_iconTop Results From Across the Web

Diving Into Delta Lake: Schema Enforcement & Evolution
Schema evolution is a feature that allows users to easily change a table's current schema to accommodate data that is changing over time....
Read more >
Schema Evolution and Compatibility | Confluent Documentation
To support this use case, you can evolve the schemas in a forward compatible way: data encoded with the new schema can be...
Read more >
Schema evolution feature - Cloudera Documentation
Schema evolution feature ... You learn that the Hive or Impala schema changes when the associated Iceberg table changes. You see examples of...
Read more >
Schema evolution ‒ Qlik Compose
Schema evolution allows users to easily detect structural changes to multiple data sources and then control how those changes will be applied to...
Read more >
Data Types, Schema Types and Schema Evolution
Schema Evolution - Changing a Schema ; Option A: Use of dynamic properties, Dynamic, Low ; Option B: In-place schema evolution redeploying the...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found