question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Incorrect result in certain situations with multiple joins with difference partition schemes

See original GitHub issue

This is a bug introduced by #12013. The result would be wrong if the following situation happens:

  • The query uses COALESCE(joinKey) on top of FULL OUTER JOIN with equi-join.
  • The children of the FullJoin node uses a different hash function to compute the partition from the join keys. For example, hash is computed on (a, constant) and join key is just a.
  • There is another JOIN with the result of FULL OUTER JOIN using equi-join on only the coalesced keys of the FULL OUTER JOIN.

In such situation, the newly introduced optimization would assume that the result of the FULL OUTER JOIN is already partitioned on COALESCE(a) thus there’s no need for another shuffle before the next join. However, because the hash function is calculated on (a, constant), even if the data is “partitioned on a” it would be on a different node as a hash function computed with just a. Thus a shuffle would still be needed to produce correct result.

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:5 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
rongrongcommented, Apr 12, 2019

Here’s an example that can produce the wrong plan. It’s harder to come up with a query that can produce meaningful & wrong results though:

SELECT * 
FROM customer t3
JOIN (
    SELECT coalesce(c1, c2) c
    FROM (
        SELECT custkey c1, name FROM customer WHERE name = 'a') t1 
    FULL OUTER JOIN (
        SELECT custkey c2, name FROM customer WHERE name = 'b' GROUP BY 1, 2) t2
    ON t1.c1 = t2.c2) t
ON t3.custkey = t.c;
0reactions
rongrongcommented, Sep 26, 2019

@tooptoop4 Yes #12946 reintroduced FULL OUTER JOIN + COALESCE optimization and it should not have this bug. It will be released in 0.227.

Read more comments on GitHub >

github_iconTop Results From Across the Web

SQL multiple joins for beginners with examples - SQLShack
Multiple joins can be described as follows; multiple join is a query that contains the same or different join types, which are used...
Read more >
Prevent duplicate values in LEFT JOIN - sql - Stack Overflow
Two SQL LEFT JOINS produce incorrect result. More explanation there. Solution for your query: SELECT p.id, p.person_name, d.department_name, c.phone_number ...
Read more >
CREATE PARTITION SCHEME (Transact-SQL) - Microsoft Learn
A. Creating a partition scheme that maps each partition to a different filegroup. The following example creates a partition function to ...
Read more >
Join Event Streams - ksqlDB Documentation
Joining collections. You can use ksqlDB to merge streams of events in real time by using the JOIN statement, which has a SQL...
Read more >
MySQL 8.0 Reference Manual :: 13.2.13.2 JOIN Clause
The NATURAL [LEFT] JOIN of two tables is defined to be semantically equivalent to an INNER JOIN or a LEFT JOIN with a...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found