Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[Feature Request] Support data type changes for schema evolution

See original GitHub issue

Feature request

Overview

Currently, schema changes are eligible for schema evolution during table appends or overwrites:

Adding new columns
Changing of data types from NullType -> any other type, or upcasts from ByteType -> ShortType -> IntegerType

Is it possible to support more data type changes during append operation with enabled schema evolution?

Motivation

For the history table updates, we want to keep all changed records as a newly appended record instead of overwriting schema when the data type has been changed from source with enabled schema evolution

For example, when the scale value of the decimal type is changed from 2 to 4 and the precision is kept unchanged

//i.e. data type of one column is changed from decimal(38,2) to decimal(38,4)
df.write.format("delta").option("mergeSchema", "true").mode("append").save(targetPath)

The error is Failed to merge decimal types with incompatible scale 2 and 4

Can this decimal scale change be supported in Delta schema evolution during table appends and please review other data types as well?

Willingness to contribute

The Delta Lake Community encourages new feature contributions. Would you or another member of your organization be willing to contribute an implementation of this feature?

Yes. I can contribute this feature independently.
Yes. I would be willing to contribute this feature with guidance from the Delta Lake community.
No. I cannot contribute this feature at this time.

Issue Analytics

State:
Created a year ago
Reactions:1
Comments:12 (6 by maintainers)

Top GitHub Comments

2reactions

zmeircommented, May 17, 2022

We currently don’t support IntegerType -> LongType

Huh… Why is that really? I naively assumed it should be just the same as ByteType -> ShortType -> IntegerType.

It just so happens that this is the exact error that led me to find this issue:

# df schema is [id: bigint]
df.write.format("delta").save(path)
# new_df schema is [id: int, name: string]
new_df.write.format("delta").mode("append").option("mergeSchema", "true").save(path)
# error: Failed to merge fields 'id' and 'id'. Failed to merge incompatible data types LongType and IntegerType

What’s interesting is that the following workaround actually works:

DeltaTable.forPath(spark, path).merge(new_df, "false").whenNotMatchedInsertAll().execute()
# schema after this command: [id: bigint, name: string]

2reactions

dennygleecommented, May 6, 2022

Thanks @frankyangdev as per our Slack conversations, this one may be a little tricky. Let me summarize some key points and add some additional points thanks to @zsxwing

Reviewing the Parquet Logical Type Definitions, it seems that decimals are stored as logical data types backed by integers per Data types in Apache Parquet - thanks @bartosz25
When reviewing Spark supported data types I was reminded that right now Delta Lake conversion support is ByteType -> ShortType -> IntegerType. We currently don’t support IntegerType -> LongType. This may be coming into play here - in that Spark is using java.math.BigDecimal to support the conversion but Parquet itself is storing things using the various primitives.
To add to it, when reviewing ParquetSchemaConverter.scala L589-L611, this may not be a simple change. Decimal may use different underlying types. When a column has different types in different parquet files, there may be issues to read it back directly.