question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[BUG] Column invariant not enforced on write

See original GitHub issue

Bug

Describe the problem

I’m trying to understand the column invariant enforcement in delta lake, so it can implement it in delta-rs. However, I am unable to get PySpark to throw an error when writing values that violate the invariant. Am I misunderstanding the spec? Or is this a bug?

Steps to reproduce

import pyarrow as pa
import pyspark
import pyspark.sql.types
import pyspark.sql.functions as F
import delta
from delta.tables import DeltaTable

def get_spark():
    builder = (
        pyspark.sql.SparkSession.builder.appName("MyApp")
        .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension")
        .config(
            "spark.sql.catalog.spark_catalog",
            "org.apache.spark.sql.delta.catalog.DeltaCatalog",
        )
    )
    return delta.configure_spark_with_delta_pip(builder).getOrCreate()

spark = get_spark()

schema = pyspark.sql.types.StructType([
    pyspark.sql.types.StructField(
        "c1", 
        dataType = pyspark.sql.types.IntegerType(), 
        nullable = False, 
        metadata = { "delta.invariants": "c1 > 3" } 
    )
])

table = DeltaTable.create(spark) \
    .tableName("testTable") \
    .addColumns(schema) \
    .execute()

# This should fail, but doesn't
spark.createDataFrame([(2,)], schema=schema).write.saveAsTable(
    "testTable",
    mode="append",
    format="delta",
)

Observed results

The write succeeds, even though the delta.invariants key is clearly in schema, the writer protocol is set to 2, and the min value of the write clearly violates the invariant.

First log file:

{"protocol":{"minReaderVersion":1,"minWriterVersion":2}}
{"metaData":{"id":"e8204eae-cd90-41c2-b685-92f22126b54a","format":{"provider":"parquet","options":{}},"schemaString":"{\"type\":\"struct\",\"fields\":[{\"name\":\"c1\",\"type\":\"integer\",\"nullable\":false,\"metadata\":{\"delta.invariants\":\"c1 > 3\"}}]}","partitionColumns":[],"configuration":{},"createdTime":1656459957813}}
{"commitInfo":{"timestamp":1656459957820,"operation":"CREATE TABLE","operationParameters":{"isManaged":"true","description":null,"partitionBy":"[]","properties":"{}"},"isolationLevel":"Serializable","isBlindAppend":true,"operationMetrics":{},"engineInfo":"Apache-Spark/3.2.1 Delta-Lake/1.2.1","txnId":"6d370f8e-211f-4624-8a40-6fbd67e905c8"}}

Second log:

{"add":{"path":"part-00000-0d61b29d-60ee-47d1-a121-2641fbc3ae1d-c000.snappy.parquet","partitionValues":{},"size":326,"modificationTime":1656459958951,"dataChange":true,"stats":"{\"numRecords\":0,\"minValues\":{},\"maxValues\":{},\"nullCount\":{}}"}}
{"add":{"path":"part-00003-b30e416e-c616-4d80-87b6-182baf8f0830-c000.snappy.parquet","partitionValues":{},"size":479,"modificationTime":1656459958981,"dataChange":true,"stats":"{\"numRecords\":1,\"minValues\":{\"c1\":2},\"maxValues\":{\"c1\":2},\"nullCount\":{\"c1\":0}}"}}
{"commitInfo":{"timestamp":1656459958996,"operation":"WRITE","operationParameters":{"mode":"Append","partitionBy":"[]"},"readVersion":0,"isolationLevel":"Serializable","isBlindAppend":true,"operationMetrics":{"numFiles":"2","numOutputRows":"1","numOutputBytes":"805"},"engineInfo":"Apache-Spark/3.2.1 Delta-Lake/1.2.1","txnId":"00a036ec-243d-4543-b7d2-186f031ca2f1"}}

Expected results

I expected it to throw an exception. This should be identical to this unit test, right? https://github.com/delta-io/delta/blob/5d3d73fe714f47bbe30e0414a8f9132000d8932c/core/src/test/scala/org/apache/spark/sql/delta/schema/InvariantEnforcementSuite.scala#L218-L232

Further details

Environment information

  • Delta Lake version: 1.2.1
  • Spark version: 3.2.1
  • Scala version:

Willingness to contribute

The Delta Lake Community encourages bug fix contributions. Would you or another member of your organization be willing to contribute a fix for this bug to the Delta Lake code base?

I have no experience with Scala, so if this is a bug I may not be able to fix it. But I’d be happy to add further clarification to the Protocol spec to clear up the expectations around delta.invariants.

Issue Analytics

  • State:open
  • Created a year ago
  • Comments:8 (3 by maintainers)

github_iconTop GitHub Comments

1reaction
vkorukanticommented, Jul 19, 2022

@wjones127 Heard from @zsxwing that this feature has bugs and is being deprecated in favor of the constraints.

0reactions
vkorukanticommented, Jul 19, 2022

@wjones127 Makes sense. Reopening. Not sure what is the procedure to deprecate the features. @tdas?

Read more comments on GitHub >

github_iconTop Results From Across the Web

Invariants: Comparing behavior with data frames
Invariants : Comparing behavior with data frames. This vignette defines invariants for subsetting and subset-assignment for tibbles, and illustrates where ...
Read more >
Error conditions in Databricks
CAST_OVERFLOW_IN_TABLE_INSERT. Fail to insert a value of <sourceType> type into the <targetType> type column <columnName> due to an overflow. ...
Read more >
Invariant Violation: Element type is invalid: expected a string ...
This error can rise if you try to import a non-existent component. Make sure you have no typo and that the component indeed...
Read more >
Constraints on Azure Databricks - Microsoft Learn
Enforced constraints on Azure Databricks; Set a NOT NULL constraint ... Adding a constraint automatically upgrades the table writer protocol ...
Read more >
Frequently Asked Questions - IQ-Tree
Finally, if no answer is found, post a question to the IQ-TREE group. The average response time is one to two working days....
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found