My goal is to search for SKU values that are duplicated if we exclude special characters and get ids of those records.
INSERT INTO
sneakers (sneaker_id, sku)
VALUES
(1, '111-111'),
(2, '111/111'),
(3, '222-222'),
(4, '333:333'),
(5, '333-333');
In this example I want to get sneaker_id 1,2,4,5 because SKUs for those records are duplicated apart from the special character.
Here’s the approach that I came up with:
SELECT sneaker_id, sku
FROM sneakers
WHERE REGEXP_REPLACE(sku, '[]\[!@#$%.&*`~^_{}:;<>/\|()-]+', ' ') IN (SELECT REGEXP_REPLACE(sku, '[]\[!@#$%.&*`~^_{}:;<>/\|()-]+', ' ')
FROM sneakers
GROUP BY REGEXP_REPLACE(sku, '[]\[!@#$%.&*`~^_{}:;<>/\|()-]+', ' ')
HAVING COUNT(REGEXP_REPLACE(sku, '[]\[!@#$%.&*`~^_{}:;<>/\|()-]+', ' ')) >1);
It works when testing it on example table, however running this query on my main table (235k records) takes way too long (up to 10 mins).
How to achieve the same result optimizing performance?
3
Answers
see: DBFIDDLE
Just count how many time the sku is in the set after replacement.
I’d suggest you maintain a second column called "clean_sku" or equivalent.
You can then use PG triggers to keep it in sync with the source data.
For example
Then create a function to maintain it:
Once you have the function, you can create insert and update triggers to maintain it:
Essentially, this conversion of the initial data to a representation that will perform for querying only needs to be repeated whenever the source data is changed.
In database programming this pattern appears again and again, we are going to put the data in a format that supports the performance needs of our application, and we are going to use the "constraining" mechanisms in the DB engine to insure our data integrity. And then we will write some documentation explaining why we did it 🙂
From there its a simple as
This should be fairly efficient
It accesses the table only once, and also makes the conversion only once.
Db fiddle
If this still doesn’t meet your performance needs you might create a functional index of the normalized sku.
Or keep it as a normal column in the table