[Feature Request] Support data type changes for schema evolution
See original GitHub issueFeature request
Overview
Currently, schema changes are eligible for schema evolution during table appends or overwrites:
- Adding new columns
- Changing of data types from NullType -> any other type, or upcasts from ByteType -> ShortType -> IntegerType
Is it possible to support more data type changes during append operation with enabled schema evolution?
Motivation
For the history table updates, we want to keep all changed records as a newly appended record instead of overwriting schema when the data type has been changed from source with enabled schema evolution
For example, when the scale value of the decimal type is changed from 2 to 4 and the precision is kept unchanged
//i.e. data type of one column is changed from decimal(38,2) to decimal(38,4)
df.write.format("delta").option("mergeSchema", "true").mode("append").save(targetPath)
The error is Failed to merge decimal types with incompatible scale 2 and 4
Can this decimal scale change be supported in Delta schema evolution during table appends and please review other data types as well?
Willingness to contribute
The Delta Lake Community encourages new feature contributions. Would you or another member of your organization be willing to contribute an implementation of this feature?
- Yes. I can contribute this feature independently.
- Yes. I would be willing to contribute this feature with guidance from the Delta Lake community.
- No. I cannot contribute this feature at this time.
Issue Analytics
- State:
- Created a year ago
- Reactions:1
- Comments:12 (6 by maintainers)
Huh… Why is that really? I naively assumed it should be just the same as
ByteType
->ShortType
->IntegerType
.It just so happens that this is the exact error that led me to find this issue:
What’s interesting is that the following workaround actually works:
Thanks @frankyangdev as per our Slack conversations, this one may be a little tricky. Let me summarize some key points and add some additional points thanks to @zsxwing
ByteType
->ShortType
->IntegerType
. We currently don’t supportIntegerType
->LongType
. This may be coming into play here - in that Spark is usingjava.math.BigDecimal
to support the conversion but Parquet itself is storing things using the various primitives.Decimal
may use different underlying types. When a column has different types in different parquet files, there may be issues to read it back directly.Saying this, if others can chime in and perhaps I may be missing something here and perhaps over-complicating things?