skip to Main Content

I am updating a table with the primary key of another table, based on some join condition.

The problem is that when multiple rows of the 1st table match with 1 row of the 2nd table, all the rows of the 1st table get the same primary key of the 2nd table.

In such cases, actually I want to update only the 1st row (for each match), so that the same primary key of the 2nd table is never assigned to more than 1 row of the 1st table.

DROP TEMPORARY TABLE IF EXISTS docs1;
DROP TEMPORARY TABLE IF EXISTS docs2;
CREATE TEMPORARY TABLE docs1 (
    id int PRIMARY KEY,
    other_table_id int,
    doc_no VARCHAR(40)
);

CREATE TEMPORARY TABLE docs2 (
    id int PRIMARY KEY,
    doc_no VARCHAR(40)
);

INSERT INTO docs1 (id, doc_no) VALUES (150, '001/2324'), (157, '01-2324'), (165, 'I/101'), (123, 'I-101');

INSERT INTO docs2 (id, doc_no) VALUES (11, '1/2324'), (37, 'I/101');

CREATE FUNCTION `RemoveSpecialCharacters` (doc_no VARCHAR(50))
RETURNS VARCHAR(50)
DETERMINISTIC
BEGIN
    RETURN TRIM(LEADING '0' FROM REPLACE(REPLACE(doc_no, '/', ''), '-', ''));
END

SET SQL_SAFE_UPDATES = 0;

UPDATE docs1 T0 
    INNER JOIN docs2 T1 
    ON RemoveSpecialCharacters(T0.doc_no) = RemoveSpecialCharacters(T1.doc_no)
    -- and many other join conditions and WHERE clauses
    SET T0.other_table_id = T1.id;

Required output in docs1:

id, other_table_id, doc_no
150, 11, 001/2324
157, NULL, 01-2324
165, NULL, I/101
123, 37, I-101

Instead of updating row id 165 with the other table id of 37, I want to treat the row id 123 as the first matched row in this case, based on the ascending order of the row id.

Is there a simple way to achieve this without writing complicated queries using ROW_NUMBER / PARTITION, because as mentioned in the comments, the actual JOIN condition involves 5-6 columns and further there are another set of conditions in the WHERE clause as well. Furthermore, I have to run a few dozen such UPDATE statements with minor variations on the JOIN conditions and the column to be updated and the value to be updated with.

The actual tables involved have around 200 thousand records in each table.

If it cannot be done in a single simplistic UPDATE query, I am okay to first select the lines and then loop over them inside a stored procedure, if it works fast enough. Ideal expected time is less than 20 seconds for 200 thousand records in each table.

Just for context, this is not a one-time activity. This process has to be run on demand whenever the user triggers it from the application.

EDITED after seeing Akina’s answer:

Can the solution / logic also be expanded to do multi row updates (and not just the 1st row update), when the 2nd table’s multiple rows match with the 1st table’s multiple rows on the same joining values? For example, if 4 docs in table 1 match with 3 docs in table 2 on the same joining values, the 1st 3 rows in table 1 should be updated with the id of the 1st 3 rows of table 2 and the 4th row should not be updated.

If yes, this will be even more brilliant ! Determining which one is the first row, second row can be based on the id ascending order.

Note that any given id of the 2nd table should never go to multiple rows of the first table. Sorry, this is an extension of the original question, but if experts can answer this too, it will be helpful.

Actually, in my project, the multi row update was the original requirement, but I had given up on this.

3

Answers


  1. Chosen as BEST ANSWER

    I realized there is 1 more solution possible without using window functions. Here is what I tried:

    I modified the docs1 table definition to set a UNIQUE constraint on other_table_id and then I used the UPDATE IGNORE statement.

    CREATE TEMPORARY TABLE docs1 (
        id int PRIMARY KEY,
        other_table_id int UNIQUE,
        doc_no VARCHAR(40)
    );
    
    UPDATE IGNORE docs1 T0 
        INNER JOIN docs2 T1 
        ON RemoveSpecialCharacters(T0.doc_no) = RemoveSpecialCharacters(T1.doc_no)
        SET T0.other_table_id = T1.id;
    

    This seems to work fine for the requirement of 1st row update for each joining group but it won't work for the extended requirement of multi row update.


  2. If you want to update only one row within some group while sorting by some key without window functions usage then use 2 copies of the table with according joining condition.

    Sample:

    CREATE TABLE test (
      group_key INT,
      ordering_key INT, 
      value INT
      );
    INSERT INTO test (group_key, ordering_key)
    VALUES
    (1,1), (1,2),
    (2,1), (2,2), (2,3);
    SELECT * FROM test ORDER BY 1,2;
    
    group_key ordering_key value
    1 1 null
    1 2 null
    2 1 null
    2 2 null
    2 3 null

    The task: update value column with 1 for each group with the same group_key, but only one row within a group with the least ordering_key value.

    UPDATE test AS t1
    LEFT JOIN test AS t2
      ON t1.group_key = t2.group_key        -- join the rows which belongs the same group
      AND t1.ordering_key > t2.ordering_key -- and additional condition, see below
    SET t1.value = 1
    WHERE t2.group_key is null              -- and there is no a row with lower ordering_key
    
    SELECT * FROM test ORDER BY 1,2;
    
    group_key ordering_key value
    1 1 1
    1 2 null
    2 1 1
    2 2 null
    2 3 null

    fiddle

    Login or Signup to reply.
  3. Here is the window function route using a subquery to perform the join and generate a row_number() which can then act as a filter in your UPDATE statement:

    UPDATE docs1 docs
        INNER JOIN (
          SELECT T0.id, T1.id as docs2id, T1.doc_no, ROW_NUMBER() OVER (PARTITION BY T1.doc_no ORDER BY T0.id) as rn
          FROM docs1 as T0 
            INNER JOIN docs2 as T1 
              ON RemoveSpecialCharacters(T0.doc_no) = RemoveSpecialCharacters(T1.doc_no)
        ) as doc_join
        ON docs.id = doc_join.id
    SET docs.other_table_id = doc_join.id
    WHERE rn = 1;
    

    dbfiddle

    That subquery should be easy to work with for your more complex join. It’s only down in a subquery to accomodate the extra processing step needed for the window function/row_number determination.

    What’s nice about this route is that you are determining the update order as opposed to using the spooky logic with the unique constraint that leaves it up to "whichever record you happen to get to first".

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search