skip to Main Content

I have two tables: table_a and table_b. Size-wise, table_b is much bigger than table_a.

There are fields(names) that are present in both the tables. The goal is to find the names (or the number of names when using count): that appear in table_a but do not appear in table_b for the maximum date value.

We compare them using the names as you can see in the code below. Using LIKE CONCAT as the name from table_a could be a substring.

SELECT count(DISTINCT "Name")
FROM table_a
WHERE table_a."Date Collected" = (
        SELECT MAX("Date Collected")
        FROM table_a
    )
    AND NOT EXISTS (

        SELECT *
        FROM table_b
        WHERE UPPER(table_b."computer_name") LIKE CONCAT('%', UPPER(table_a."Name"), '%')
            AND table_b."date_collected" = (
                SELECT MAX("date_collected")
                FROM table_b
            )
    )

This query takes a long time to run and the values aren’t matching up- meaning we are not getting an accurate count or list.

Are there anyways this could be made more efficient and accurate?

2

Answers


  1. Have you tried starting with table_a left join table_b? Did you check whether you really need to use UPPER? It feels like that might impede some automatic optimizations or use of indices/statistics. Do you have indices/statistics on the larger table?

    Login or Signup to reply.
  2. Assuming Postgres, this should do what you describe:

    WITH a AS (
       SELECT a."Name"
       FROM   table_a a
       ORDER  BY a."Date Collected" DESC NULLS LAST
       FETCH  FIRST 1 ROWS WITH TIES
       )
    , b AS (
       SELECT computer_name
       FROM   table_b b
       ORDER  BY date_collected DESC NULLS LAST
       FETCH  FIRST 1 ROWS WITH TIES
       )
    SELECT *
    FROM   a
    LEFT   JOIN b ON b.computer_name ~* a."Name"
    WHERE  b.computer_name IS NULL;
    

    CTE a gets all names from table A with the latest date in that table.
    CTE b gets all computer names from B with the latest date in that table.
    I use the fitting and typically efficient WITH TIES for the purpose. See:

    The outer query only keeps names from a that are not contained in any computer name in b.
    Using the more elegant case-insensitive regular expression operator ~* there.

    And the LEFT JOIN / IS NULL technique. See:

    The pivotal point for performance (and whether that’s in fact the best query) will be fitting indexes for the query considering cardinalities, data distribution, and selectivity. None of which are clear from the question.

    Also, consider proper, consistent, loser-case names for your DB. See:

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search