skip to Main Content

Struggling with a large dataset in my mariaDB database. I have two tables, where table A contains 57 million rows and table B contains around 500. Table B is a subset of ids related to a column in table A. I want to delete all rows from A which do not have a corresponding ID in table B.

Example table A:

classification_id Name
20 Mercedes
30 Kawasaki
80 Leitz
70 HP

Example table B:

classification_id Type
20 car
30 bike
40 bus
50 boat

So in this example the last two rows from table A would be deleted (or a mirror table would be made containing only the first two rows, thats also fine).

I tried to do the second one using an inner join but this query took a few minutes before giving an out of memory exception.

Any suggestions on how to tackle this?

2

Answers


  1. try this:

    delete from "table A" where classification_id not in (select classification_id from "table B");
    
    Login or Signup to reply.
  2. Since you say that the filter table contains a relatively small number of rows, your best bet would be creating a separate table that contains the same columns as the original table A and the rows that match your criteria, then replace the original table and drop it. Also, with this number of IDs you probably want to use WHERE IN () instead of joins – as long as the field you’re using there is indexed, it will usually be way faster. Bringing it all together:

    CREATE TABLE new_A AS
      SELECT A.* FROM A 
      WHERE classification_id IN (SELECT classification_id FROM B);
    RENAME TABLE A TO old_A, new_A to A;
    DROP TABLE old_A;
    

    Things to be aware of:

    • Backup your data! And test the queries thoroughly before running that DROP TABLE. You don’t want to lose 57M rows of data because of a random answer at StackOverflow.
    • If A has any indexes or foreign keys, these won’t be copied over – so you’ll have to recreate them all manually. I’d recommend running SHOW CREATE TABLE A first and making note on its structure. Alternatively, you may consider creating the table new_A explicitly using the output of SHOW CREATE TABLE A as a template and then performing INSERT INTO new_A SELECT ... instead of CREATE TABLE new_A AS SELECT ... with the same query after this.
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search