question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[Feature Request] add WHEN NOT MATCHED BY SOURCE/TARGET clause suppoort

See original GitHub issue

Feature request

https://delta-users.slack.com/archives/CJ70UCSHM/p1661955032288519

Overview

WHEN NOT MATCHED BY SOURCE/TARGET clause support

Motivation

feature parity with popular other SQL databases, ease of use

Further details

Each day I get a full dump of a table. However, this data needs to be cleaned and in particular, compressed using the SCD2 style approach to be easily consumable downstream. Unfortunately, I do not get changesets or a NULL value for a key in case of deletions. I only receive NO LONGER a row (including the key). The links:

MERGE <target_table> [AS TARGET]
USING <table_source> [AS SOURCE]
ON <search_condition>
[WHEN MATCHED 
   THEN <merge_matched> ]
[WHEN NOT MATCHED [BY TARGET]
   THEN <merge_not_matched> ]
[WHEN NOT MATCHED BY SOURCE
   THEN <merge_matched> ];

will not work with Delta/Spark as the WHEN NOT MATCHED clause does not seem to support the BY SOURCE | TARGET extension. How can I still calculate the SCD2 representation?

  1. If the table is empty simply take all the data (for the initial load)
  2. When a new day/full copy of the data arrives:
  • INSERT any new keys into the table
  • For EXISTING keys perform an update (set the OLD value to be no longer valid (set end-date) and produce a new row in SCD2 with the contents of the new row and validity until infinity (end-date null))
  • In case a previously present key Is no longer present close the SCD2 valid_from/valid_to interval by setting end-date
    • In case a new record arrives for this key in the future start a new fresh SCD2 row valid until infinity for this new row/values.

An example case/dataset:

import pandas as pd
import numpy as np
# assumes a running spark session including support for deltalake to be available

d1 = spark.createDataFrame(pd.DataFrame({'key':[1,2,3], 'value':[4,5,6],'value2':["a", "b", "c"], 'date':[1,1,1]}))
#d1.show()

# notice one entry is MISSING (it should be deleted) or rather SCD2 invalidated
d2 = spark.createDataFrame(pd.DataFrame({'key':[1,2], 'value':[4,5], 'date':[2,2],'value2':["a", "b"]}))

# d2 had (3) as missing - this entry is back now (and should start a new SCD2 row
d3 = spark.createDataFrame(pd.DataFrame({'key':[1,2,3], 'value':[4,5, 66], 'date':[3,3,3], 'value2':["a", "b", "c"]}))

# a new record is added
d4 = spark.createDataFrame(pd.DataFrame({'key':[1,2,3, 4], 'value':[4,5, 66, 44], 'date':[4,4,4,4], 'value2':["a", "b", "c", "d"]}))

# a new record is added, one removed and one updated
d5 = spark.createDataFrame(pd.DataFrame({'key':[2,3, 4, 5], 'value':[5, 67, 44, 55], 'date':[5,5,5,5], 'value2':["b", "c", "d", "e"]}))

Willingness to contribute

The Delta Lake Community encourages new feature contributions. Would you or another member of your organization be willing to contribute an implementation of this feature?

  • Yes. I can contribute this feature independently.
  • Yes. I would be willing to contribute this feature with guidance from the Delta Lake community.
  • No. I cannot contribute this feature at this time.

Issue Analytics

  • State:open
  • Created a year ago
  • Reactions:2
  • Comments:7 (2 by maintainers)

github_iconTop GitHub Comments

7reactions
johanl-dbcommented, Dec 6, 2022

I created a design doc to implement support for WHEN NOT MATCHED BY SOURCE clauses: [Design Doc] WHEN NOT MATCHED BY SOURCE. This enables selectively updating or deleting target rows that have no matches in the source table based on the merge condition.

API

A new whenNotMatchedBySource(condition) method is added to the Delta Table API, similar to the existing whenMatched(condition) and whenNotMatched(condition) methods. It returns a builder that allows specifying the action to apply using update(set) or delete(). whenNotMatchedBySource(condition) accepts an optional condition that needs to be satisfied for the corresponding action to be applied.

Usage example:

targetDeltaTable.as(“t”)
  .merge(sourceTable.as(“s”), condition = “t.key = s.key”)
  .whenMatched().updateAll()
  .whenNotMatched().insertAll()
  .whenNotMatchedBySource(condition = “t.value > 0”).update(set = “t.value = 0”)
  .whenNotMatchedBySource().delete()

This merge invocation will:

  • update all target rows that have a match in the source table using the source value.
  • Insert all source rows that have no match in the target into the target table.
  • Update all target rows that have no match in the source if t.value is strictly positive, otherwise delete the target row.

More details on the API and the implementation proposal can be found in the design doc. The SQL API will be shipped with Spark 3.4, see https://github.com/apache/spark/pull/38400.

Project Plan

Task Description Status PR
Delta API Scala Support Implement support for the clause in Delta using the Scala DeltaTable API. Review #1511
Delta API Python Support Implement support for the clause in Delta using the Python API. Not Started
SQL Support After Spark 3.4 release / upgrading to Spark 3.4, make necessary changes to support the clause in SQL. Not Started
0reactions
seddonm1commented, Sep 30, 2022

FYI I actually did all the work a couple of years ago and have a branch with this implemented for the Scala API only here: https://github.com/tripl-ai/delta

At the time the PR was rejected to this repo but if you are motivated the code could be updated for latest Delta (not by me).

Read more comments on GitHub >

github_iconTop Results From Across the Web

15.10 - Rules for MERGE WHEN MATCHED and WHEN NOT ...
When a MERGE request specifies a WHEN MATCHED and a WHEN NOT MATCHED clause, then the INSERT and UPDATE specifications of those clauses...
Read more >
Way to do MERGE with update source when target not match?
Attempt 1. I first tried to get the value via a merge by doing an insert into the result table when matched, and...
Read more >
Merge WHEN NOT MATCHED BY SOURCE - Microsoft Q&A
WHEN NOT MATCHED BY TARGET always results in an INSERT. That is, this is for rows that exists in the source, but are...
Read more >
SQL Server MERGE Statement overview and examples
WHEN NOT MATCHED BY TARGET clause is used to insert rows into target table that does not match join condition with a source...
Read more >
SQL Server : how to use SELECT results in INSERT clause of ...
column1) WHEN MATCHED THEN UPDATE SET target.column2 = source.column2 WHEN NOT MATCHED THEN INSERT values (source.column1, source.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found