skip to Main Content

I have a list of n rows (n < 100) with ‘#’ separated values in postgres (version 13+)

row  value
1   'ABC#DEF#123'
2   'ABC#DEF#156'
3   'ABC#DEF#178'
...

These values can be of any length (e.g. ‘ABC#DEFDEF#GHIG#123123#456#678’)
There are no NULL values

Now imagine you split these strings on the ‘#’ separator, I would like to get the maximum prefix (entire items only), or concatenation of matching items FROM THE START (joined back with ‘#’)
so in that case the return row would be:

row  output
1    'ABC#DEF' (or 'ABC#DEF#')

but only counting FULL substrings in between separators, for instance

row  value
1   'ABC#123'
2   'ADE#456'

should return

row  output
1    ''

since ‘ABC’ and ‘ADE’ do not match and should not return ‘A’

we also only want these results to match from the beginning
e.g.

row  value
1   'ABC#DEF#123'
2   'ADE#DEF#123'

should return

row  output
1    ''

currently I’m using

CREATE OR REPLACE FUNCTION lcp_iterate(_state TEXT, value TEXT)
RETURNS TEXT
AS
$$
        SELECT  SUBSTRING($2, 1, s - 1)
        FROM    generate_series(1, LEAST(LENGTH($1), LENGTH($2))) s
        WHERE   SUBSTRING($1, 1, s) <> SUBSTRING($2, 1, s)
        UNION ALL
        SELECT  LEAST($1, $2)
        LIMIT 1;
$$
LANGUAGE 'sql';

DO $$ BEGIN
    CREATE AGGREGATE lcp(TEXT) (SFUNC = lcp_iterate, STYPE = TEXT);
EXCEPTION
    when sqlstate '42723' then null;
END $$;

but it returns

row  output
1    'A'

It would be extremely simple to do in any programming language, but I’m struggling to find a way to do it in Postgre

any help appreciated !

PS: I’m trying to keep my codebase as readable as possible so I’d like to avoid nesting 20 CASE WHEN THEN into each others (since I can’t have more than 20 #)

base SQL to try:

WITH source AS (
  SELECT 'ABC#DEF#123' AS value
  UNION ALL 
  SELECT 'ABC#DEF#456'
)

SELECT my_func(*) 
FROM source
GROUP BY value

2

Answers


  1. One way is with a RECURSIVE CTE:

    WITH RECURSIVE cte AS (
       (
       SELECT id, value, 1 AS part, ''::text AS result
       FROM   tbl
       ORDER  BY id
       LIMIT  1
       )
    
       UNION ALL
       SELECT c.id, c.value, c.part + 1, concat_ws('#', NULLIF(c.result, ''), split_part(c.value, '#', c.part))
       FROM   cte c
       WHERE  split_part(c.value, '#', c.part) <> ''  -- reached end of string
       AND    NOT EXISTS (
          SELECT FROM tbl t
          WHERE  t.id > c.id
          AND    t.value NOT LIKE concat_ws('#', NULLIF(c.result, ''), split_part(c.value, '#', c.part)) || '%'
          )
       )
    SELECT result
    FROM   cte
    ORDER  BY part DESC
    LIMIT  1;
    

    fiddle

    Pick the first row.
    Then for every substring in value check if there is any other row that disagrees. Keep looping till the first mismatch or end of string.
    The last iteration yields the longest common prefix. (For a single row it’s the whole value.)

    The beauty: it breaks early, as soon as the first mismatch is found, and does not look at the rest, however long that may be.

    Assuming id is the PRIMARY KEY, and value is defined NOT NULL, or you need to declare how to deal with null values (and do more).

    I formulated the predicate with NOT LIKE to employ a trigram or COLLATE "C" index on (value). In Postgres 15 or later I would use the "starts with" operator ^@ instead. See:

    The function split_part() is instrumental here:

    Related:

    Login or Signup to reply.
  2. Unnesting your delimited values is still a good option (updated from my comment for your more complicated requirements).

    Consider:

    WITH source AS (
      SELECT 'ABC#ABC#DEF#123' AS value
      UNION ALL 
      SELECT 'ABC#ABC#DEF#456'
      UNION ALL
      SELECT 'ABC#ABC#DEF#456'
      UNION ALL
      SELECT 'ABC#ABC#DEF#456'
      UNION ALL 
      SELECT 'ABC#ABC#456'
    )
    , tokenization as (
        SELECT  
          token, 
          idx, 
          recordcount, 
          count(token) as tokencount
        FROM (SELECT count(*) OVER () as recordcount, source.* FROM source)
             ,unnest(string_to_array(value, '#')) WITH ORDINALITY AS a(token, idx) 
        GROUP BY token, idx, recordcount
        HAVING recordcount = count(token)
    )
    SELECT string_agg(token, '#' ORDER BY idx)
    FROM tokenization
    

    That tokenization cte is going to check the count for each token (each substring) with respect to its index (its location in the original string) and compare that to the entire record count for the table. Once that check is performed it will string_agg() your tokens back together in order of the index which will give you back the substring of the original strings that matches across all records.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search