overrideSchema changes nullability of all fields into true
See original GitHub issueHey When I’m trying to add new column through overwriteSchema, all fields become nullable.
Before:
root
|-- id: integer (nullable = false)
|-- name: string (nullable = false)
|-- country: string (nullable = false)
After:
root
|-- id: integer (nullable = true)
|-- name: string (nullable = true)
|-- country: string (nullable = true)
|-- new_id: integer (nullable = true)
Show example source code
import io.delta.tables.DeltaTable
import org.apache.spark.sql.functions.col
import org.apache.spark.sql.types._
import org.apache.spark.sql.{SaveMode, SparkSession}
object OverrideSchemaTest {
def main(args: Array[String]): Unit = {
val spark = SparkSession
.builder
.master("local[*]")
.appName("spark app")
.config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension")
.config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog")
.getOrCreate()
DeltaTable
.create(spark)
.location(deltaTablePath)
.addColumns(
StructType(Seq(
StructField("id", IntegerType, nullable=false, new MetadataBuilder()
.putString("description", "test")
.build()
),
StructField("name", StringType, nullable=false),
StructField("country", StringType, nullable=false),
))
)
.partitionedBy("country")
.execute()
import spark.implicits._
val df = Seq(
(1, "foo", "US"),
(2, "bar", "US"),
(3, "baz", "US"),
(4, "foo_uk", "UK")
)
.toDF("id", "name", "country")
DeltaTable
.forPath(deltaTablePath)
.as("entry")
.merge(
df.as("new_entry"),
"entry.id = new_entry.id"
)
.whenMatched()
.updateAll()
.whenNotMatched()
.insertAll()
.execute()
// Before
DeltaTable
.forPath(spark, deltaTablePath)
.toDF
.schema
.printTreeString()
val newDf = DeltaTable
.forPath(spark, deltaTablePath)
.toDF
newDf
.withColumn(
"new_id",
col("id")
.as(
"new_id",
new MetadataBuilder()
.putString("description", "new id")
.build()
)
)
.write
.format("delta")
.option("overwriteSchema", "true")
.mode(SaveMode.Overwrite)
.partitionBy("country")
.save(deltaTablePath)
// After
DeltaTable
.forPath(spark, deltaTablePath)
.toDF
.schema
.printTreeString()
}
def deltaTablePath: String = System.getProperty("user.dir") + "/var/names"
def createSparkSession: SparkSession = {
SparkSession
.builder
.master("local[*]")
.appName("spark app")
.config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension")
.config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog")
.getOrCreate()
}
}
is this an expected behavior?
Issue Analytics
- State:
- Created 2 years ago
- Reactions:1
- Comments:12 (7 by maintainers)
Top Results From Across the Web
Spark withColumn changes column nullable property in schema
I'm using withColumn in order to override a certain column (applying the same value to the entire data frame), my problem is that...
Read more >Using nullability in GraphQL
By default, every field in GraphQL is nullable, and you can opt in to mark it non-null. In this article, we'll go over...
Read more >Merging different schemas in Apache Spark - Medium
To simulate schema changes, I created some fictitious data using the library ... As all partitions have these columns, the read function can...
Read more >Parquet Files - Spark 3.3.1 Documentation
When reading Parquet files, all columns are automatically converted to be nullable for compatibility reasons. Loading Data Programmatically. Using the data from ...
Read more >addNotNullConstraint | Liquibase Docs
The addNotNullConstraint Change Type enforces a column to always contain a value and not to accept NULL values so that you cannot insert...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
@Kimahriman I checked what you suggested and it seems to work (which makes the inconsistency even bigger).
So here is “short” explanation on how to add column into a Table without changing all existing columns nullability into true:
DeltaTable.create(spark)
(nullable = true)
and useDeltaTable.replace(spark)
Source Code example:
Show example source code
Output:
So technically speaking it worked
Thanks @Kimahriman for providing the suggestion. I’m closing this as the problem seems resolved. Feel free to re-open this if I’m wrong.