skip to Main Content

I have table towns which is main table. This table contains so many rows and it became so ‘dirty’ (someone inserted 5 milions rows) that I would like to get rid of unused towns.

There are 3 referent table that are using my town_id as reference to towns.

And I know there are many towns that are not used in this tables, and only if town_id is not found in neither of these 3 tables I am considering it as inactive and I would like to remove that town (because it’s not used).

as you can see towns is used in this 2 different tables:

  • employees
  • offices

and for table * vendors there is vendor_id in table towns since one vendor can have multiple towns.

so if vendor_id in towns is null and town_id is not found in any of these 2 tables it is safe to remove it 🙂

I created a query which might work but it is taking tooooo much time to execute, and it looks something like this:

select count(*) 
from towns
where vendor_id is null 
    and id not in (select town_id from banks) 
    and id not in (select town_id from employees)

So basically I said, if vendor_is is null it means this town is definately not related to vendors and in the same time if same town is not in banks and employees, than it will be safe to remove it.. but query took too long, and never executed successfully...since towns has 5 milions rows and that is reason why it is so dirty..

In face I’m not able to execute given query since server terminated abnormally..

Here is full error message:

ERROR: server closed the connection unexpectedly This probably means
the server terminated abnormally before or while processing the
request.

Any kind of help would be awesome
Thanks!

3

Answers


  1. You can try an JOIN on big tables it would be faster then two IN

    you could also try UNION ALL and live with the duplicates, as it is faster as UNION

    Finally you can use a combined Index on id and vendor_id, to speed up the query

    CREATE TABLe towns (id int , vendor_id int)
    
    CREATE TABLE
    
    CREATE tABLE banks (town_id int)
    
    CREATE TABLE
    
    CREATE tABLE employees (town_id int)
    
    CREATE TABLE
    
    select count(*) 
    from towns t1 JOIN (select town_id from banks UNION select town_id from employees) t2 on t1.id <> t2.town_id
    where vendor_id is null
    
    count
    0
    SELECT 1
    

    fiddle

    Login or Signup to reply.
  2. You can join the tables using LEFT JOIN so that to identify the town_id for which there is no row in tables banks and employee in the WHERE clause :

    WITH list AS
    ( SELECT t.town_id
        FROM towns AS t
        LEFT JOIN tbl.banks AS b ON b.town_id = t.town_id
        LEFT JOIN tbl.employees AS e ON e.town_id = t.town_id
       WHERE t.vendor_id IS NULL
         AND b.town_id IS NULL
         AND e.town_id IS NULL
       LIMIT 1000
    )
    DELETE FROM tbl.towns AS t
      USING list AS l
     WHERE t.town_id = l.town_id ;
    

    Before launching the DELETE, you can check the indexes on your tables.
    Adding an index as follow can be usefull :

    CREATE INDEX town_id_nulls ON towns (town_id NULLS FIRST) ;
    

    Last but not least you can add a LIMIT clause in the cte so that to limit the number of rows you detele when you execute the DELETE and avoid the unexpected termination. As a consequence, you will have to relaunch the DELETE several times until there is no more row to delete.

    Login or Signup to reply.
  3. The trick is to first make a list of all the town_id’s you want to keep and then start removing those that are not there.
    By looking in 2 tables you’re making life harder for the server so let’s just create 1 single list first.

    -- build empty temp-table
    CREATE TEMPORARY TABLE TEMP_must_keep 
    AS
    SELECT town_id       
      FROM tbl.towns
     WHERE 1 = 2;
     
    -- get id's from first table
    INSERT TEMP_must_keep (town_id)
    SELECT DISTINCT town_id 
      FROM tbl.banks;
      
    -- add index to speed up the EXCEPT below
    CREATE UNIQUE INDEX idx_uq_must_keep_town_id ON TEMP_must_keep (town_id);
    
    -- add new ones from second table
    INSERT TEMP_must_keep (town_id)
    SELECT town_id 
      FROM tbl.employees
     
    EXCEPT -- auto-distincts
    
    SELECT town_id 
      FROM TEMP_must_keep;
      
    -- rebuild index simply to ensure little fragmentation
    REINDEX TABLE TEMP_must_keep;
    
    -- optional, but might help: create a temporary index on the towns table to speed up the delete
    CREATE INDEX idx_towns_town_id_where_vendor_null ON tbl.towns (town_id) WHERE vendor IS NULL;
    
    -- Now do actual delete
    -- You can do a `SELECT COUNT(*)` rather than a `DELETE` first if you feel like it, both will probably take some time depending on your hardware.
    DELETE 
      FROM tbl.towns as del      
     WHERE vendor_id is null 
       AND NOT EXISTS ( SELECT * 
                          FROM TEMP_must_keep mk
                         WHERE mk.town_id = del.town_id);
                         
                         
    -- cleanup
    DROP INDEX tbl.idx_towns_town_id_where_vendor_null;
    DROP TABLE TEMP_must_keep;
    
                         
    

    The idx_towns_town_id_where_vendor_null is optional and I’m not sure if it will actaully lower the total time but IMHO it will help out with the DELETE operation if only because the index should give the Query Optimizer a better view on what volumes to expect.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search