question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

overrideSchema changes nullability of all fields into true

See original GitHub issue

Hey When I’m trying to add new column through overwriteSchema, all fields become nullable.

Before:

root
 |-- id: integer (nullable = false)
 |-- name: string (nullable = false)
 |-- country: string (nullable = false)

After:

root
 |-- id: integer (nullable = true)
 |-- name: string (nullable = true)
 |-- country: string (nullable = true)
 |-- new_id: integer (nullable = true)
Show example source code
import io.delta.tables.DeltaTable
import org.apache.spark.sql.functions.col
import org.apache.spark.sql.types._
import org.apache.spark.sql.{SaveMode, SparkSession}

object OverrideSchemaTest {

  def main(args: Array[String]): Unit = {
    val spark = SparkSession
      .builder
      .master("local[*]")
      .appName("spark app")
      .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension")
      .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog")
      .getOrCreate()

    DeltaTable
      .create(spark)
      .location(deltaTablePath)
      .addColumns(
        StructType(Seq(
          StructField("id", IntegerType, nullable=false, new MetadataBuilder()
            .putString("description", "test")
            .build()
          ),
          StructField("name", StringType, nullable=false),
          StructField("country", StringType, nullable=false),
        ))
      )
      .partitionedBy("country")
      .execute()

    import spark.implicits._

    val df = Seq(
        (1, "foo", "US"),
        (2, "bar", "US"),
        (3, "baz", "US"),
        (4, "foo_uk", "UK")
      )
      .toDF("id", "name", "country")


    DeltaTable
      .forPath(deltaTablePath)
      .as("entry")
      .merge(
        df.as("new_entry"),
        "entry.id = new_entry.id"
      )
      .whenMatched()
      .updateAll()
      .whenNotMatched()
      .insertAll()
      .execute()

    // Before
    DeltaTable
      .forPath(spark, deltaTablePath)
      .toDF
      .schema
      .printTreeString()

    val newDf = DeltaTable
      .forPath(spark, deltaTablePath)
      .toDF

    newDf
      .withColumn(
        "new_id",
        col("id")
          .as(
            "new_id",
            new MetadataBuilder()
            .putString("description", "new id")
            .build()
          )
      )
      .write
      .format("delta")
      .option("overwriteSchema", "true")
      .mode(SaveMode.Overwrite)
      .partitionBy("country")
      .save(deltaTablePath)

    // After
    DeltaTable
      .forPath(spark, deltaTablePath)
      .toDF
      .schema
      .printTreeString()
  }

  def deltaTablePath: String = System.getProperty("user.dir") + "/var/names"

  def createSparkSession: SparkSession = {
    SparkSession
      .builder
      .master("local[*]")
      .appName("spark app")
      .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension")
      .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog")
      .getOrCreate()
  }
}

is this an expected behavior?

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Reactions:1
  • Comments:12 (7 by maintainers)

github_iconTop GitHub Comments

1reaction
norbertmwkcommented, Jan 17, 2022

@Kimahriman I checked what you suggested and it seems to work (which makes the inconsistency even bigger).

So here is “short” explanation on how to add column into a Table without changing all existing columns nullability into true:

  1. Create table using DeltaTable.create(spark)
  2. Insert data into table
  3. Take schema from current table, add into it new column that is (nullable = true) and use DeltaTable.replace(spark)
  4. Time travel into the version before latest “REPLACE TABLE” operation (one before current version), read it, add that new column into DataFrame and save it again

Source Code example:

Show example source code
import io.delta.tables.DeltaTable
import org.apache.spark.sql.functions.lit
import org.apache.spark.sql.types._
import org.apache.spark.sql.{SaveMode, SparkSession}

import java.time.LocalDateTime
import java.time.format.DateTimeFormatter

object OverrideSchemaTest {

  lazy val tableName: String = "names_" + LocalDateTime.now().format(DateTimeFormatter.ofPattern("yyyyMMddHHmmss"))

  def main(args: Array[String]): Unit = {
    val spark = createSparkSession

    DeltaTable
      .create(spark)
      .location(deltaTablePath)
      .addColumns(
        StructType(Seq(
          StructField("id", IntegerType, nullable=false, new MetadataBuilder()
            .putString("description", "test")
            .build()
          ),
          StructField("name", StringType, nullable=false),
          StructField("country", StringType, nullable=false),
        ))
      )
      .partitionedBy("country")
      .execute()

    println("Initial schema of the empty table:")

    DeltaTable.forPath(deltaTablePath)
      .history()
      .select("version", "timestamp", "operation", "operationParameters", "readVersion")
      .show(false)

    import spark.implicits._

    val df = Seq(
      (1, "foo", "US"),
      (2, "bar", "US"),
      (3, "baz", "US"),
      (4, "foo_uk", "UK")
    )
      .toDF("id", "name", "country")

    DeltaTable
      .forPath(deltaTablePath)
      .as("entry")
      .merge(
        df.as("new_entry"),
        "entry.id = new_entry.id"
      )
      .whenMatched()
      .updateAll()
      .whenNotMatched()
      .insertAll()
      .execute()

    println("First Insert:")
    DeltaTable.forPath(deltaTablePath)
      .toDF
      .show(false)

    println("Table History before replace:")
    DeltaTable.forPath(deltaTablePath)
      .history()
      .select("version", "timestamp", "operation", "operationParameters", "readVersion")
      .show(false)

    DeltaTable
      .replace(spark)
      .location(deltaTablePath)
      .addColumns(
        StructType(Seq(
          StructField("id", IntegerType, nullable=false, new MetadataBuilder()
            .putString("description", "test")
            .build()
          ),
          StructField("name", StringType, nullable=false),
          StructField("country", StringType, nullable=false),
          StructField("new_id", IntegerType, nullable=true),
        ))
      )
      .partitionedBy("country")
      .execute()


    println("Table History after Table.Replace operation:")
    DeltaTable.forPath(deltaTablePath)
      .history()
      .select("version", "timestamp", "operation", "operationParameters", "readVersion")
      .show(false)

    println("Schema after Table.Replace operation:")

    DeltaTable.forPath(deltaTablePath)
      .toDF
      .printSchema()

    println("Data in table after Table.Replace operation:")
    DeltaTable.forPath(deltaTablePath)
      .toDF
      .show(false)

    println("Time travel to data before Table.Replace and merge it into new empty table...")

    spark
      .read
      .format("delta")
      .option("versionAsOf", 1)
      .load(deltaTablePath)
        .withColumn("new_id", lit(null))
      .write
      .format("delta")
      .mode(SaveMode.Overwrite)
      .save(deltaTablePath)

    println("Latest history:")

    DeltaTable.forPath(deltaTablePath)
      .history()
      .select("version", "timestamp", "operation", "operationParameters", "readVersion")
      .show(false)

    println("Latest schema:")

    DeltaTable.forPath(deltaTablePath)
      .toDF
      .printSchema()

    println("Latest data:")

    DeltaTable.forPath(deltaTablePath)
      .toDF
      .show(false)
  }

  def deltaTablePath: String = System.getProperty("user.dir") + "/var/" + tableName

  def createSparkSession: SparkSession = {
    SparkSession
      .builder
      .master("local[*]")
      .appName("spark app")
      .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension")
      .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog")
      .getOrCreate()
  }
}

Output:

Initial schema of the empty table:
+-------+-------------------+------------+---------------------------------------------------------------------------------------+-----------+
|version|timestamp          |operation   |operationParameters                                                                    |readVersion|
+-------+-------------------+------------+---------------------------------------------------------------------------------------+-----------+
|0      |2022-01-17 19:11:08|CREATE TABLE|{isManaged -> false, description -> null, partitionBy -> ["country"], properties -> {}}|null       |
+-------+-------------------+------------+---------------------------------------------------------------------------------------+-----------+

First Insert:
+---+------+-------+
|id |name  |country|
+---+------+-------+
|4  |foo_uk|UK     |
|1  |foo   |US     |
|2  |bar   |US     |
|3  |baz   |US     |
+---+------+-------+

Table History before replace:
+-------+-------------------+------------+-------------------------------------------------------------------------------------------------------------------------------------------+-----------+
|version|timestamp          |operation   |operationParameters                                                                                                                        |readVersion|
+-------+-------------------+------------+-------------------------------------------------------------------------------------------------------------------------------------------+-----------+
|1      |2022-01-17 19:11:19|MERGE       |{predicate -> (entry.id = new_entry.id), matchedPredicates -> [{"actionType":"update"}], notMatchedPredicates -> [{"actionType":"insert"}]}|0          |
|0      |2022-01-17 19:11:08|CREATE TABLE|{isManaged -> false, description -> null, partitionBy -> ["country"], properties -> {}}                                                    |null       |
+-------+-------------------+------------+-------------------------------------------------------------------------------------------------------------------------------------------+-----------+

Table History after Table.Replace operation:
+-------+-------------------+-------------+-------------------------------------------------------------------------------------------------------------------------------------------+-----------+
|version|timestamp          |operation    |operationParameters                                                                                                                        |readVersion|
+-------+-------------------+-------------+-------------------------------------------------------------------------------------------------------------------------------------------+-----------+
|2      |2022-01-17 19:11:21|REPLACE TABLE|{isManaged -> false, description -> null, partitionBy -> ["country"], properties -> {}}                                                    |1          |
|1      |2022-01-17 19:11:19|MERGE        |{predicate -> (entry.id = new_entry.id), matchedPredicates -> [{"actionType":"update"}], notMatchedPredicates -> [{"actionType":"insert"}]}|0          |
|0      |2022-01-17 19:11:08|CREATE TABLE |{isManaged -> false, description -> null, partitionBy -> ["country"], properties -> {}}                                                    |null       |
+-------+-------------------+-------------+-------------------------------------------------------------------------------------------------------------------------------------------+-----------+

Schema after Table.Replace operation:
root
 |-- id: integer (nullable = false)
 |-- name: string (nullable = false)
 |-- country: string (nullable = false)
 |-- new_id: integer (nullable = true)

Data in table after Table.Replace operation:
+---+----+-------+------+
|id |name|country|new_id|
+---+----+-------+------+
+---+----+-------+------+

Time travel to data before Table.Replace and merge it into new empty table...
Latest history:
+-------+-------------------+-------------+-------------------------------------------------------------------------------------------------------------------------------------------+-----------+
|version|timestamp          |operation    |operationParameters                                                                                                                        |readVersion|
+-------+-------------------+-------------+-------------------------------------------------------------------------------------------------------------------------------------------+-----------+
|3      |2022-01-17 19:11:24|WRITE        |{mode -> Overwrite, partitionBy -> []}                                                                                                     |2          |
|2      |2022-01-17 19:11:21|REPLACE TABLE|{isManaged -> false, description -> null, partitionBy -> ["country"], properties -> {}}                                                    |1          |
|1      |2022-01-17 19:11:19|MERGE        |{predicate -> (entry.id = new_entry.id), matchedPredicates -> [{"actionType":"update"}], notMatchedPredicates -> [{"actionType":"insert"}]}|0          |
|0      |2022-01-17 19:11:08|CREATE TABLE |{isManaged -> false, description -> null, partitionBy -> ["country"], properties -> {}}                                                    |null       |
+-------+-------------------+-------------+-------------------------------------------------------------------------------------------------------------------------------------------+-----------+

Latest schema:
root
 |-- id: integer (nullable = false)
 |-- name: string (nullable = false)
 |-- country: string (nullable = false)
 |-- new_id: integer (nullable = true)

Latest data:
+---+------+-------+------+
|id |name  |country|new_id|
+---+------+-------+------+
|4  |foo_uk|UK     |null  |
|1  |foo   |US     |null  |
|2  |bar   |US     |null  |
|3  |baz   |US     |null  |
+---+------+-------+------+

So technically speaking it worked

0reactions
zsxwingcommented, May 24, 2022

Thanks @Kimahriman for providing the suggestion. I’m closing this as the problem seems resolved. Feel free to re-open this if I’m wrong.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Spark withColumn changes column nullable property in schema
I'm using withColumn in order to override a certain column (applying the same value to the entire data frame), my problem is that...
Read more >
Using nullability in GraphQL
By default, every field in GraphQL is nullable, and you can opt in to mark it non-null. In this article, we'll go over...
Read more >
Merging different schemas in Apache Spark - Medium
To simulate schema changes, I created some fictitious data using the library ... As all partitions have these columns, the read function can...
Read more >
Parquet Files - Spark 3.3.1 Documentation
When reading Parquet files, all columns are automatically converted to be nullable for compatibility reasons. Loading Data Programmatically. Using the data from ...
Read more >
addNotNullConstraint | Liquibase Docs
The addNotNullConstraint Change Type enforces a column to always contain a value and not to accept NULL values so that you cannot insert...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found