skip to Main Content

I have 2 tables. t1.uuid has 630,000 distinct values. t2.uuid has 300,000 distinct values.

When I run

SELECT 
    t1.uuid
    , t2.uuid
FROM 
    t1 --630,000 uuids
    LEFT OUTER JOIN t2 -- 300,000 uuids
        ON t1.uuid = t2.uuid 
WHERE
    t2.uuid IS NULL

There are no results.

Edit1:
For clarification:
t1 looks like:

uuid
ufo123
abc456
def789

t2 looks like

uuid
ufo123
def789

Every record in t2 has a match in t1. Not every record in t1 has a match in t2 (as indicated by the size of the tables.

I expect the join to result in:

t1.uuid t2.uuid
ufo123 ufo123
abc456 NULL
def789 def789

And the result of my query to by:

t1.uuid t2.uuid
abc456 NULL

But instead I get no results. I’m not sure if there is a communication issue between Redshift and DBEAVER to produce this behavior.

Edit2:
I ran the following which resulted in only 300,000 records (same as only t2):

SELECT 
    COUNT(DISTINCT t1.uuid)
FROM 
    t1 --630,000 uuids
    FULL JOIN t2 -- 300,000 uuids
        ON t1.uuid = t2.uuid 

This is completely wrong.

2

Answers


  1. The problem is that you cannot check NULL with equal sign in Redshift. "NULL = NULL" is false.

    You need to expand your JOIN ON clause (I left it general case which isn’t needed with you WHERE clause):

    SELECT 
        t1.uuid
        , t2.uuid
    FROM 
        t1 --630,000 uuids
        LEFT OUTER JOIN t2 -- 300,000 uuids
            ON (t1.uuid = t2.uuid) OR (t1.uuid IS NULL AND t2.uuid IS NULL)
    WHERE
        t2.uuid IS NULL
    

    The problem is that when both t1 and t2 have NULL uuid these will all join to each other and could explode your results. So you may want to check this in a different way if there are a lot of NULLs.

    As for an explanation let me address the "why" of this. This has to do with the Redshift inferring a WHERE clause from your SQL. It goes like this – if t2.uuid = X and t1.uuid = t2.uuid then Redshift should be able to apply the WHERE clause t1.uuid = X. But this produces the null set give the reasons above. Redshift does this to reduce the scanned data as much as possible.

    Update:

    Note that you are recreating the EXCEPT clause with this LEFT OUTER JOIN. You will likely get better results with EXCEPT.

    Login or Signup to reply.
  2. Your logic looks fine. Testing on a sample data. 3 records in table a and 1 record in table b. Output is 2 records that don’t match.

    WITH t1 AS (SELECT 1 AS col UNION SELECT 2 UNION SELECT 3)
       , t2 AS (SELECT 1 AS col)
    SELECT *
    FROM t1
         LEFT JOIN t2 ON t1.col = t2.col
    where t2.col is null
    

    Output

    +------+-------+
    |t1.col|t2.col |
    +------+-------+
    |2     |NULL   |
    |3     |NULL   |
    +------+-------+
    

    I’d recommend checking the data that it’s matching on. As Pelayo Martinez mentioned, it’s highly likely your LEFT table is T2 instead of T1. Only then it’ll be a no row result.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search