I am updating a table with the primary key of another table, based on some join condition.
The problem is that when multiple rows of the 1st table match with 1 row of the 2nd table, all the rows of the 1st table get the same primary key of the 2nd table.
In such cases, actually I want to update only the 1st row (for each match), so that the same primary key of the 2nd table is never assigned to more than 1 row of the 1st table.
DROP TEMPORARY TABLE IF EXISTS docs1;
DROP TEMPORARY TABLE IF EXISTS docs2;
CREATE TEMPORARY TABLE docs1 (
id int PRIMARY KEY,
other_table_id int,
doc_no VARCHAR(40)
);
CREATE TEMPORARY TABLE docs2 (
id int PRIMARY KEY,
doc_no VARCHAR(40)
);
INSERT INTO docs1 (id, doc_no) VALUES (150, '001/2324'), (157, '01-2324'), (165, 'I/101'), (123, 'I-101');
INSERT INTO docs2 (id, doc_no) VALUES (11, '1/2324'), (37, 'I/101');
CREATE FUNCTION `RemoveSpecialCharacters` (doc_no VARCHAR(50))
RETURNS VARCHAR(50)
DETERMINISTIC
BEGIN
RETURN TRIM(LEADING '0' FROM REPLACE(REPLACE(doc_no, '/', ''), '-', ''));
END
SET SQL_SAFE_UPDATES = 0;
UPDATE docs1 T0
INNER JOIN docs2 T1
ON RemoveSpecialCharacters(T0.doc_no) = RemoveSpecialCharacters(T1.doc_no)
-- and many other join conditions and WHERE clauses
SET T0.other_table_id = T1.id;
Required output in docs1:
id, other_table_id, doc_no
150, 11, 001/2324
157, NULL, 01-2324
165, NULL, I/101
123, 37, I-101
Instead of updating row id 165 with the other table id of 37, I want to treat the row id 123 as the first matched row in this case, based on the ascending order of the row id.
Is there a simple way to achieve this without writing complicated queries using ROW_NUMBER / PARTITION, because as mentioned in the comments, the actual JOIN condition involves 5-6 columns and further there are another set of conditions in the WHERE clause as well. Furthermore, I have to run a few dozen such UPDATE statements with minor variations on the JOIN conditions and the column to be updated and the value to be updated with.
The actual tables involved have around 200 thousand records in each table.
If it cannot be done in a single simplistic UPDATE query, I am okay to first select the lines and then loop over them inside a stored procedure, if it works fast enough. Ideal expected time is less than 20 seconds for 200 thousand records in each table.
Just for context, this is not a one-time activity. This process has to be run on demand whenever the user triggers it from the application.
EDITED after seeing Akina’s answer:
Can the solution / logic also be expanded to do multi row updates (and not just the 1st row update), when the 2nd table’s multiple rows match with the 1st table’s multiple rows on the same joining values? For example, if 4 docs in table 1 match with 3 docs in table 2 on the same joining values, the 1st 3 rows in table 1 should be updated with the id of the 1st 3 rows of table 2 and the 4th row should not be updated.
If yes, this will be even more brilliant ! Determining which one is the first row, second row can be based on the id ascending order.
Note that any given id of the 2nd table should never go to multiple rows of the first table. Sorry, this is an extension of the original question, but if experts can answer this too, it will be helpful.
Actually, in my project, the multi row update was the original requirement, but I had given up on this.
3
Answers
I realized there is 1 more solution possible without using window functions. Here is what I tried:
I modified the docs1 table definition to set a UNIQUE constraint on other_table_id and then I used the UPDATE IGNORE statement.
This seems to work fine for the requirement of 1st row update for each joining group but it won't work for the extended requirement of multi row update.
If you want to update only one row within some group while sorting by some key without window functions usage then use 2 copies of the table with according joining condition.
Sample:
fiddle
Here is the window function route using a subquery to perform the join and generate a
row_number()
which can then act as a filter in your UPDATE statement:dbfiddle
That subquery should be easy to work with for your more complex join. It’s only down in a subquery to accomodate the extra processing step needed for the window function/row_number determination.
What’s nice about this route is that you are determining the update order as opposed to using the spooky logic with the unique constraint that leaves it up to "whichever record you happen to get to first".