skip to Main Content

My goal is to search for SKU values that are duplicated if we exclude special characters and get ids of those records.

INSERT INTO 
    sneakers (sneaker_id, sku)
VALUES
    (1, '111-111'),
    (2, '111/111'),
    (3, '222-222'),
    (4, '333:333'),
    (5, '333-333');

In this example I want to get sneaker_id 1,2,4,5 because SKUs for those records are duplicated apart from the special character.
Here’s the approach that I came up with:

SELECT sneaker_id, sku
FROM sneakers
WHERE REGEXP_REPLACE(sku, '[]\[!@#$%.&*`~^_{}:;<>/\|()-]+', ' ') IN (SELECT REGEXP_REPLACE(sku, '[]\[!@#$%.&*`~^_{}:;<>/\|()-]+', ' ')
FROM sneakers
GROUP BY REGEXP_REPLACE(sku, '[]\[!@#$%.&*`~^_{}:;<>/\|()-]+', ' ')
HAVING COUNT(REGEXP_REPLACE(sku, '[]\[!@#$%.&*`~^_{}:;<>/\|()-]+', ' ')) >1);

It works when testing it on example table, however running this query on my main table (235k records) takes way too long (up to 10 mins).
How to achieve the same result optimizing performance?

3

Answers


  1. SELECT 
      sneaker_id,
      sku,
      REGEXP_REPLACE(sku,'[]\[!@#$%.&*`~^_{}:;<>/\|()-]+', ' '),
      x.count
    FROM sneakers
    LEFT JOIN (SELECT REGEXP_REPLACE(sku,'[]\[!@#$%.&*`~^_{}:;<>/\|()-]+', ' ') skux, count(*) 
             from sneakers 
             group by REGEXP_REPLACE(sku,'[]\[!@#$%.&*`~^_{}:;<>/\|()-]+', ' ')) x ON 
                  x.skux = REGEXP_REPLACE(sku,'[]\[!@#$%.&*`~^_{}:;<>/\|()-]+', ' ')
    WHERE x.count > 1
    

    see: DBFIDDLE

    Just count how many time the sku is in the set after replacement.

    Login or Signup to reply.
  2. I’d suggest you maintain a second column called "clean_sku" or equivalent.

    You can then use PG triggers to keep it in sync with the source data.

    For example

    alter table sneakers add clean_sku varchar(64); -- check size, no schema in original question
    

    Then create a function to maintain it:

    CREATE OR REPLACE FUNCTION sneaker_update()
      RETURNS trigger AS
    $BODY$
    BEGIN
        new.clean_sku := REGEXP_REPLACE(new.sku, '[]\[!@#$%.&*`~^_{}:;<>/\|()-]+', ' ');
        RETURN new;
    END;
    $BODY$
    LANGUAGE plpgsql;
    

    Once you have the function, you can create insert and update triggers to maintain it:

    CREATE or replace TRIGGER sneakerinsert
      BEFORE INSERT
      ON sneakers
      FOR EACH ROW
      EXECUTE PROCEDURE sneaker_update();
    
    CREATE or replace TRIGGER sneakerupdate
      before update of sku
      ON sneakers
      FOR EACH ROW
      EXECUTE PROCEDURE sneaker_update();
    

    Essentially, this conversion of the initial data to a representation that will perform for querying only needs to be repeated whenever the source data is changed.

    In database programming this pattern appears again and again, we are going to put the data in a format that supports the performance needs of our application, and we are going to use the "constraining" mechanisms in the DB engine to insure our data integrity. And then we will write some documentation explaining why we did it 🙂

    From there its a simple as

    SELECT  sneaker_id
    FROM    sneakers
    WHERE   clean_sku IN
    (
        SELECT  clean_sku
        FROM    sneakers
        GROUP   BY clean_sku
        HAVING  COUNT(*) > 1
    )
    
          
    
    Login or Signup to reply.
  3. This should be fairly efficient

    select * 
    from (
      select 
        x.*,
        count(*) over (partition by skux) cnt
      from (
        SELECT
          sneaker_id,
          sku,
          REGEXP_REPLACE(sku,'[]\[!@#$%.&*`~^_{}:;<>/\|()-]+', ' ') skux
        from sneakers 
      ) x
    ) x
    where cnt > 1
    

    It accesses the table only once, and also makes the conversion only once.

    Db fiddle

    If this still doesn’t meet your performance needs you might create a functional index of the normalized sku.

    Or keep it as a normal column in the table

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search