Select values that are duplicated after removing special characters in PostgreSQL

MichaFrczek
October 24, 2023
178 views
3 votes
3 Answers

My goal is to search for SKU values that are duplicated if we exclude special characters and get ids of those records.

INSERT INTO 
    sneakers (sneaker_id, sku)
VALUES
    (1, '111-111'),
    (2, '111/111'),
    (3, '222-222'),
    (4, '333:333'),
    (5, '333-333');

In this example I want to get sneaker_id 1,2,4,5 because SKUs for those records are duplicated apart from the special character.
Here’s the approach that I came up with:

SELECT sneaker_id, sku
FROM sneakers
WHERE REGEXP_REPLACE(sku, '[]\[!@#$%.&*`~^_{}:;<>/\|()-]+', ' ') IN (SELECT REGEXP_REPLACE(sku, '[]\[!@#$%.&*`~^_{}:;<>/\|()-]+', ' ')
FROM sneakers
GROUP BY REGEXP_REPLACE(sku, '[]\[!@#$%.&*`~^_{}:;<>/\|()-]+', ' ')
HAVING COUNT(REGEXP_REPLACE(sku, '[]\[!@#$%.&*`~^_{}:;<>/\|()-]+', ' ')) >1);

It works when testing it on example table, however running this query on my main table (235k records) takes way too long (up to 10 mins).
How to achieve the same result optimizing performance?

Tags: postgresql sql

Answers

SELECT 
  sneaker_id,
  sku,
  REGEXP_REPLACE(sku,'[]\[!@#$%.&*`~^_{}:;<>/\|()-]+', ' '),
  x.count
FROM sneakers
LEFT JOIN (SELECT REGEXP_REPLACE(sku,'[]\[!@#$%.&*`~^_{}:;<>/\|()-]+', ' ') skux, count(*) 
         from sneakers 
         group by REGEXP_REPLACE(sku,'[]\[!@#$%.&*`~^_{}:;<>/\|()-]+', ' ')) x ON 
              x.skux = REGEXP_REPLACE(sku,'[]\[!@#$%.&*`~^_{}:;<>/\|()-]+', ' ')
WHERE x.count > 1

see: DBFIDDLE

Just count how many time the sku is in the set after replacement.

- BretWeinraub
- October 24, 2023 at 1:57 pm
- 0 votes
0
I’d suggest you maintain a second column called "clean_sku" or equivalent.

You can then use PG triggers to keep it in sync with the source data.

For example
```
alter table sneakers add clean_sku varchar(64); -- check size, no schema in original question
```
Then create a function to maintain it:
```
CREATE OR REPLACE FUNCTION sneaker_update()
  RETURNS trigger AS
$BODY$
BEGIN
    new.clean_sku := REGEXP_REPLACE(new.sku, '[]\[!@#$%.&*`~^_{}:;<>/\|()-]+', ' ');
    RETURN new;
END;
$BODY$
LANGUAGE plpgsql;
```
Once you have the function, you can create insert and update triggers to maintain it:
```
CREATE or replace TRIGGER sneakerinsert
  BEFORE INSERT
  ON sneakers
  FOR EACH ROW
  EXECUTE PROCEDURE sneaker_update();

CREATE or replace TRIGGER sneakerupdate
  before update of sku
  ON sneakers
  FOR EACH ROW
  EXECUTE PROCEDURE sneaker_update();
```
Essentially, this conversion of the initial data to a representation that will perform for querying only needs to be repeated whenever the source data is changed.

In database programming this pattern appears again and again, we are going to put the data in a format that supports the performance needs of our application, and we are going to use the "constraining" mechanisms in the DB engine to insure our data integrity. And then we will write some documentation explaining why we did it 🙂

From there its a simple as
```
SELECT  sneaker_id
FROM    sneakers
WHERE   clean_sku IN
(
    SELECT  clean_sku
    FROM    sneakers
    GROUP   BY clean_sku
    HAVING  COUNT(*) > 1
)

      
```
Login or Signup to reply.

- JensSchauder
- October 24, 2023 at 1:58 pm
- 0 votes
0
This should be fairly efficient
```
select * 
from (
  select 
    x.*,
    count(*) over (partition by skux) cnt
  from (
    SELECT
      sneaker_id,
      sku,
      REGEXP_REPLACE(sku,'[]\[!@#$%.&*`~^_{}:;<>/\|()-]+', ' ') skux
    from sneakers 
  ) x
) x
where cnt > 1
```
It accesses the table only once, and also makes the conversion only once.

Db fiddle

If this still doesn’t meet your performance needs you might create a functional index of the normalized sku.

Or keep it as a normal column in the table
Login or Signup to reply.

Please signup or login to give your own answer.

Click here to cancel reply.