Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

overrideSchema changes nullability of all fields into true

See original GitHub issue

Hey When I’m trying to add new column through overwriteSchema, all fields become nullable.

Before:

root
 |-- id: integer (nullable = false)
 |-- name: string (nullable = false)
 |-- country: string (nullable = false)

After:

root
 |-- id: integer (nullable = true)
 |-- name: string (nullable = true)
 |-- country: string (nullable = true)
 |-- new_id: integer (nullable = true)

Show example source code

import io.delta.tables.DeltaTable
import org.apache.spark.sql.functions.col
import org.apache.spark.sql.types._
import org.apache.spark.sql.{SaveMode, SparkSession}

object OverrideSchemaTest {

  def main(args: Array[String]): Unit = {
    val spark = SparkSession
      .builder
      .master("local[*]")
      .appName("spark app")
      .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension")
      .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog")
      .getOrCreate()

    DeltaTable
      .create(spark)
      .location(deltaTablePath)
      .addColumns(
        StructType(Seq(
          StructField("id", IntegerType, nullable=false, new MetadataBuilder()
            .putString("description", "test")
            .build()
          ),
          StructField("name", StringType, nullable=false),
          StructField("country", StringType, nullable=false),
        ))
      )
      .partitionedBy("country")
      .execute()

    import spark.implicits._

    val df = Seq(
        (1, "foo", "US"),
        (2, "bar", "US"),
        (3, "baz", "US"),
        (4, "foo_uk", "UK")
      )
      .toDF("id", "name", "country")


    DeltaTable
      .forPath(deltaTablePath)
      .as("entry")
      .merge(
        df.as("new_entry"),
        "entry.id = new_entry.id"
      )
      .whenMatched()
      .updateAll()
      .whenNotMatched()
      .insertAll()
      .execute()

    // Before
    DeltaTable
      .forPath(spark, deltaTablePath)
      .toDF
      .schema
      .printTreeString()

    val newDf = DeltaTable
      .forPath(spark, deltaTablePath)
      .toDF

    newDf
      .withColumn(
        "new_id",
        col("id")
          .as(
            "new_id",
            new MetadataBuilder()
            .putString("description", "new id")
            .build()
          )
      )
      .write
      .format("delta")
      .option("overwriteSchema", "true")
      .mode(SaveMode.Overwrite)
      .partitionBy("country")
      .save(deltaTablePath)

    // After
    DeltaTable
      .forPath(spark, deltaTablePath)
      .toDF
      .schema
      .printTreeString()
  }

  def deltaTablePath: String = System.getProperty("user.dir") + "/var/names"

  def createSparkSession: SparkSession = {
    SparkSession
      .builder
      .master("local[*]")
      .appName("spark app")
      .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension")
      .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog")
      .getOrCreate()
  }
}

is this an expected behavior?

Issue Analytics

State:
Created 2 years ago
Reactions:1
Comments:12 (7 by maintainers)

Top GitHub Comments

1reaction

norbertmwkcommented, Jan 17, 2022

@Kimahriman I checked what you suggested and it seems to work (which makes the inconsistency even bigger).

So here is “short” explanation on how to add column into a Table without changing all existing columns nullability into true:

Create table using DeltaTable.create(spark)
Insert data into table
Take schema from current table, add into it new column that is (nullable = true) and use DeltaTable.replace(spark)
Time travel into the version before latest “REPLACE TABLE” operation (one before current version), read it, add that new column into DataFrame and save it again

Source Code example:

Show example source code

import io.delta.tables.DeltaTable
import org.apache.spark.sql.functions.lit
import org.apache.spark.sql.types._
import org.apache.spark.sql.{SaveMode, SparkSession}

import java.time.LocalDateTime
import java.time.format.DateTimeFormatter

object OverrideSchemaTest {

  lazy val tableName: String = "names_" + LocalDateTime.now().format(DateTimeFormatter.ofPattern("yyyyMMddHHmmss"))

  def main(args: Array[String]): Unit = {
    val spark = createSparkSession

    DeltaTable
      .create(spark)
      .location(deltaTablePath)
      .addColumns(
        StructType(Seq(
          StructField("id", IntegerType, nullable=false, new MetadataBuilder()
            .putString("description", "test")
            .build()
          ),
          StructField("name", StringType, nullable=false),
          StructField("country", StringType, nullable=false),
        ))
      )
      .partitionedBy("country")
      .execute()

    println("Initial schema of the empty table:")

    DeltaTable.forPath(deltaTablePath)
      .history()
      .select("version", "timestamp", "operation", "operationParameters", "readVersion")
      .show(false)

    import spark.implicits._

    val df = Seq(
      (1, "foo", "US"),
      (2, "bar", "US"),
      (3, "baz", "US"),
      (4, "foo_uk", "UK")
    )
      .toDF("id", "name", "country")

    DeltaTable
      .forPath(deltaTablePath)
      .as("entry")
      .merge(
        df.as("new_entry"),
        "entry.id = new_entry.id"
      )
      .whenMatched()
      .updateAll()
      .whenNotMatched()
      .insertAll()
      .execute()

    println("First Insert:")
    DeltaTable.forPath(deltaTablePath)
      .toDF
      .show(false)

    println("Table History before replace:")
    DeltaTable.forPath(deltaTablePath)
      .history()
      .select("version", "timestamp", "operation", "operationParameters", "readVersion")
      .show(false)

    DeltaTable
      .replace(spark)
      .location(deltaTablePath)
      .addColumns(
        StructType(Seq(
          StructField("id", IntegerType, nullable=false, new MetadataBuilder()
            .putString("description", "test")
            .build()
          ),
          StructField("name", StringType, nullable=false),
          StructField("country", StringType, nullable=false),
          StructField("new_id", IntegerType, nullable=true),
        ))
      )
      .partitionedBy("country")
      .execute()


    println("Table History after Table.Replace operation:")
    DeltaTable.forPath(deltaTablePath)
      .history()
      .select("version", "timestamp", "operation", "operationParameters", "readVersion")
      .show(false)

    println("Schema after Table.Replace operation:")

    DeltaTable.forPath(deltaTablePath)
      .toDF
      .printSchema()

    println("Data in table after Table.Replace operation:")
    DeltaTable.forPath(deltaTablePath)
      .toDF
      .show(false)

    println("Time travel to data before Table.Replace and merge it into new empty table...")

    spark
      .read
      .format("delta")
      .option("versionAsOf", 1)
      .load(deltaTablePath)
        .withColumn("new_id", lit(null))
      .write
      .format("delta")
      .mode(SaveMode.Overwrite)
      .save(deltaTablePath)

    println("Latest history:")

    DeltaTable.forPath(deltaTablePath)
      .history()
      .select("version", "timestamp", "operation", "operationParameters", "readVersion")
      .show(false)

    println("Latest schema:")

    DeltaTable.forPath(deltaTablePath)
      .toDF
      .printSchema()

    println("Latest data:")

    DeltaTable.forPath(deltaTablePath)
      .toDF
      .show(false)
  }

  def deltaTablePath: String = System.getProperty("user.dir") + "/var/" + tableName

  def createSparkSession: SparkSession = {
    SparkSession
      .builder
      .master("local[*]")
      .appName("spark app")
      .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension")
      .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog")
      .getOrCreate()
  }
}

Output:

Initial schema of the empty table:
+-------+-------------------+------------+---------------------------------------------------------------------------------------+-----------+
|version|timestamp          |operation   |operationParameters                                                                    |readVersion|
+-------+-------------------+------------+---------------------------------------------------------------------------------------+-----------+
|0      |2022-01-17 19:11:08|CREATE TABLE|{isManaged -> false, description -> null, partitionBy -> ["country"], properties -> {}}|null       |
+-------+-------------------+------------+---------------------------------------------------------------------------------------+-----------+

First Insert:
+---+------+-------+
|id |name  |country|
+---+------+-------+
|4  |foo_uk|UK     |
|1  |foo   |US     |
|2  |bar   |US     |
|3  |baz   |US     |
+---+------+-------+

Table History before replace:
+-------+-------------------+------------+-------------------------------------------------------------------------------------------------------------------------------------------+-----------+
|version|timestamp          |operation   |operationParameters                                                                                                                        |readVersion|
+-------+-------------------+------------+-------------------------------------------------------------------------------------------------------------------------------------------+-----------+
|1      |2022-01-17 19:11:19|MERGE       |{predicate -> (entry.id = new_entry.id), matchedPredicates -> [{"actionType":"update"}], notMatchedPredicates -> [{"actionType":"insert"}]}|0          |
|0      |2022-01-17 19:11:08|CREATE TABLE|{isManaged -> false, description -> null, partitionBy -> ["country"], properties -> {}}                                                    |null       |
+-------+-------------------+------------+-------------------------------------------------------------------------------------------------------------------------------------------+-----------+

Table History after Table.Replace operation:
+-------+-------------------+-------------+-------------------------------------------------------------------------------------------------------------------------------------------+-----------+
|version|timestamp          |operation    |operationParameters                                                                                                                        |readVersion|
+-------+-------------------+-------------+-------------------------------------------------------------------------------------------------------------------------------------------+-----------+
|2      |2022-01-17 19:11:21|REPLACE TABLE|{isManaged -> false, description -> null, partitionBy -> ["country"], properties -> {}}                                                    |1          |
|1      |2022-01-17 19:11:19|MERGE        |{predicate -> (entry.id = new_entry.id), matchedPredicates -> [{"actionType":"update"}], notMatchedPredicates -> [{"actionType":"insert"}]}|0          |
|0      |2022-01-17 19:11:08|CREATE TABLE |{isManaged -> false, description -> null, partitionBy -> ["country"], properties -> {}}                                                    |null       |
+-------+-------------------+-------------+-------------------------------------------------------------------------------------------------------------------------------------------+-----------+

Schema after Table.Replace operation:
root
 |-- id: integer (nullable = false)
 |-- name: string (nullable = false)
 |-- country: string (nullable = false)
 |-- new_id: integer (nullable = true)

Data in table after Table.Replace operation:
+---+----+-------+------+
|id |name|country|new_id|
+---+----+-------+------+
+---+----+-------+------+

Time travel to data before Table.Replace and merge it into new empty table...
Latest history:
+-------+-------------------+-------------+-------------------------------------------------------------------------------------------------------------------------------------------+-----------+
|version|timestamp          |operation    |operationParameters                                                                                                                        |readVersion|
+-------+-------------------+-------------+-------------------------------------------------------------------------------------------------------------------------------------------+-----------+
|3      |2022-01-17 19:11:24|WRITE        |{mode -> Overwrite, partitionBy -> []}                                                                                                     |2          |
|2      |2022-01-17 19:11:21|REPLACE TABLE|{isManaged -> false, description -> null, partitionBy -> ["country"], properties -> {}}                                                    |1          |
|1      |2022-01-17 19:11:19|MERGE        |{predicate -> (entry.id = new_entry.id), matchedPredicates -> [{"actionType":"update"}], notMatchedPredicates -> [{"actionType":"insert"}]}|0          |
|0      |2022-01-17 19:11:08|CREATE TABLE |{isManaged -> false, description -> null, partitionBy -> ["country"], properties -> {}}                                                    |null       |
+-------+-------------------+-------------+-------------------------------------------------------------------------------------------------------------------------------------------+-----------+

Latest schema:
root
 |-- id: integer (nullable = false)
 |-- name: string (nullable = false)
 |-- country: string (nullable = false)
 |-- new_id: integer (nullable = true)

Latest data:
+---+------+-------+------+
|id |name  |country|new_id|
+---+------+-------+------+
|4  |foo_uk|UK     |null  |
|1  |foo   |US     |null  |
|2  |bar   |US     |null  |
|3  |baz   |US     |null  |
+---+------+-------+------+

So technically speaking it worked

0reactions

zsxwingcommented, May 24, 2022

Thanks @Kimahriman for providing the suggestion. I’m closing this as the problem seems resolved. Feel free to re-open this if I’m wrong.

Top Results From Across the Web

Spark withColumn changes column nullable property in schema

I'm using withColumn in order to override a certain column (applying the same value to the entire data frame), my problem is that...

Using nullability in GraphQL

By default, every field in GraphQL is nullable, and you can opt in to mark it non-null. In this article, we'll go over...

Merging different schemas in Apache Spark - Medium

To simulate schema changes, I created some fictitious data using the library ... As all partitions have these columns, the read function can...

Parquet Files - Spark 3.3.1 Documentation

When reading Parquet files, all columns are automatically converted to be nullable for compatibility reasons. Loading Data Programmatically. Using the data from ...

addNotNullConstraint | Liquibase Docs

The addNotNullConstraint Change Type enforces a column to always contain a value and not to accept NULL values so that you cannot insert...