skip to Main Content

In this dbfiddle demo I have a DELETE FROM... WHERE at the end like so:

    ......
    DELETE FROM data_table 
      WHERE
        (location, param_id, ref_time, fcst_time) NOT IN (SELECT location, param_id, ref_time, fcst_time FROM sel1)
      AND
        (location, param_id, ref_time, fcst_time) NOT IN (SELECT location, param_id, ref_time, fcst_time FROM sel2);

Although it works, it seems unnecessarily wordy, and possibly also not optimal in terms of performance?

Is there any way in which this can be simplified, e.g. by using a single NOT IN statement?

Because of the way that they are defined, there is no overlap/intersection between sel1 and sel2.

status_table has 8033 rows

data_table has 116432724 rows

sel1 has 61860084 rows

sel2 has 53706188 rows

Numbers don’t necessarily add up because it’s a live database and data is going in all the time. And yes, at the moment there are few if any rows to delete because most/all of the data is current.

2

Answers


  1. How about another WITH around everything?

    WITH d AS (
    WITH
      stats AS (
        SELECT ref_time
          , max(updated) < (round(extract(epoch from now()) / 60) - 200) AS settled
          , (count(*) FILTER (WHERE processed) = count(*)) AND (max(updated) < (round(extract(epoch from now()) / 60) - 200)) AS ready
        FROM status_table
        GROUP BY ref_time
      ),
      min_ts AS (
        SELECT ref_time FROM stats WHERE ready ORDER BY ref_time DESC LIMIT 1
      ),
      sel1 AS (
        -- records that would be selected by an actual data lookup (use same logic)... we need to keep these (don't delete)
        SELECT DISTINCT ON (d.location, d.timestamp, d.param_id)
          d.location, d.param_id, d.ref_time, d.fcst_time
        FROM data_table AS d
        INNER JOIN stats s USING (ref_time)
        WHERE s.ready AND d.timestamp >= (SELECT ref_time FROM min_ts)
        ORDER BY d.location, d.timestamp, d.param_id, d.ref_time DESC
      ),
      sel2 AS (
        -- also keep all records that are in-progress (not 'settled')
        SELECT
          d.location, d.param_id, d.ref_time, d.fcst_time
        FROM data_table AS d
        INNER JOIN stats AS s USING (ref_time)
        WHERE NOT s.settled
      )
    SELECT data_table.*
    FROM
        data_table LEFT JOIN
        sel1 ON sel1.location=data_table.location AND sel1.param_id=data_table.param_id AND sel1.ref_time = data_table.ref_time AND sel1.fcst_time = data_table.fcst_time LEFT JOIN
        sel2 ON sel2.location=data_table.location AND sel2.param_id=data_table.param_id AND sel2.ref_time = data_table.ref_time AND sel2.fcst_time = data_table.fcst_time
    WHERE
        sel1.location IS NULL AND
        sel2.location IS NULL
    )    
    DELETE FROM data_table
    WHERE (location, param_id, ref_time, fcst_time) IN (SELECT location, param_id, ref_time, fcst_time FROM d)
    
    Login or Signup to reply.
  2. Answering the original question in this thread:

    ......
    DELETE FROM data_table 
      WHERE
        (location, param_id, ref_time, fcst_time) 
            NOT IN 
        (SELECT location, param_id, ref_time, fcst_time FROM sel1
            UNION ALL
         SELECT location, param_id, ref_time, fcst_time FROM sel2);
    

    Changing WHERE (a) NOT IN (b) AND NOT IN (c) to
    WHERE (a) NOT IN (b UNION c) not only shortens the expression, but also provides some performance improvements. Depending on how big an overlap there is between b and c, you can pick either UNION or UNION ALL:

    • WHERE (a) NOT IN (b UNION c) removes duplicates between b and c which adds a small cost initially but can speed things up later when a is compared against the effectively smaller union.
    • WHERE (a) NOT IN (b UNION ALL c) skips deduplication, so if by design there can’t be any, it’s the better choice.

    Here’s a demo showing how plans change. Locally, I also tested a delete from 120m row table against 2x 10m tables with 99% overlap and got 118s for initial version, 105s for union, 98s for union all. With all the same counts and no overlap, it was 118s, 103s, 95s. With more columns and more complex types, I expect the difference to be more pronounced.

    I figured I’d make my comment into an answer, adding the tests that followed, leaving the broader optimisation to be discussed in the new thread.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search