How to find the maximum prefix of a token separated string in PostgreSQL

Shraneid
September 19, 2023
272 views
0 votes
2 Answers

I have a list of n rows (n < 100) with ‘#’ separated values in postgres (version 13+)

row  value
1   'ABC#DEF#123'
2   'ABC#DEF#156'
3   'ABC#DEF#178'
...

These values can be of any length (e.g. ‘ABC#DEFDEF#GHIG#123123#456#678’)
There are no NULL values

Now imagine you split these strings on the ‘#’ separator, I would like to get the maximum prefix (entire items only), or concatenation of matching items FROM THE START (joined back with ‘#’)
so in that case the return row would be:

row  output
1    'ABC#DEF' (or 'ABC#DEF#')

but only counting FULL substrings in between separators, for instance

row  value
1   'ABC#123'
2   'ADE#456'

should return

row  output
1    ''

since ‘ABC’ and ‘ADE’ do not match and should not return ‘A’

we also only want these results to match from the beginning
e.g.

row  value
1   'ABC#DEF#123'
2   'ADE#DEF#123'

should return

row  output
1    ''

currently I’m using

CREATE OR REPLACE FUNCTION lcp_iterate(_state TEXT, value TEXT)
RETURNS TEXT
AS
$$
        SELECT  SUBSTRING($2, 1, s - 1)
        FROM    generate_series(1, LEAST(LENGTH($1), LENGTH($2))) s
        WHERE   SUBSTRING($1, 1, s) <> SUBSTRING($2, 1, s)
        UNION ALL
        SELECT  LEAST($1, $2)
        LIMIT 1;
$$
LANGUAGE 'sql';

DO $$ BEGIN
    CREATE AGGREGATE lcp(TEXT) (SFUNC = lcp_iterate, STYPE = TEXT);
EXCEPTION
    when sqlstate '42723' then null;
END $$;

but it returns

row  output
1    'A'

It would be extremely simple to do in any programming language, but I’m struggling to find a way to do it in Postgre

any help appreciated !

PS: I’m trying to keep my codebase as readable as possible so I’d like to avoid nesting 20 CASE WHEN THEN into each others (since I can’t have more than 20 #)

base SQL to try:

WITH source AS (
  SELECT 'ABC#DEF#123' AS value
  UNION ALL 
  SELECT 'ABC#DEF#456'
)

SELECT my_func(*) 
FROM source
GROUP BY value

Answers

- ErwinBrandstetter
- September 18, 2023 at 6:39 pm
- 0 votes
0
One way is with a RECURSIVE CTE:
```
WITH RECURSIVE cte AS (
   (
   SELECT id, value, 1 AS part, ''::text AS result
   FROM   tbl
   ORDER  BY id
   LIMIT  1
   )

   UNION ALL
   SELECT c.id, c.value, c.part + 1, concat_ws('#', NULLIF(c.result, ''), split_part(c.value, '#', c.part))
   FROM   cte c
   WHERE  split_part(c.value, '#', c.part) <> ''  -- reached end of string
   AND    NOT EXISTS (
      SELECT FROM tbl t
      WHERE  t.id > c.id
      AND    t.value NOT LIKE concat_ws('#', NULLIF(c.result, ''), split_part(c.value, '#', c.part)) || '%'
      )
   )
SELECT result
FROM   cte
ORDER  BY part DESC
LIMIT  1;
```
fiddle

Pick the first row.
Then for every substring in value check if there is any other row that disagrees. Keep looping till the first mismatch or end of string.
The last iteration yields the longest common prefix. (For a single row it’s the whole value.)

The beauty: it breaks early, as soon as the first mismatch is found, and does not look at the rest, however long that may be.

Assuming id is the PRIMARY KEY, and value is defined NOT NULL, or you need to declare how to deal with null values (and do more).

I formulated the predicate with NOT LIKE to employ a trigram or COLLATE "C" index on (value). In Postgres 15 or later I would use the "starts with" operator ^@ instead. See:
- PostgreSQL LIKE query performance variations
The function split_part() is instrumental here:
- Split comma separated column data into additional columns
Related:
- Algorithm for finding the longest prefix
- Select rows which are not present in other table
Login or Signup to reply.

- JNevill
- September 18, 2023 at 6:48 pm
- 0 votes
0
Unnesting your delimited values is still a good option (updated from my comment for your more complicated requirements).

Consider:
```
WITH source AS (
  SELECT 'ABC#ABC#DEF#123' AS value
  UNION ALL 
  SELECT 'ABC#ABC#DEF#456'
  UNION ALL
  SELECT 'ABC#ABC#DEF#456'
  UNION ALL
  SELECT 'ABC#ABC#DEF#456'
  UNION ALL 
  SELECT 'ABC#ABC#456'
)
, tokenization as (
    SELECT  
      token, 
      idx, 
      recordcount, 
      count(token) as tokencount
    FROM (SELECT count(*) OVER () as recordcount, source.* FROM source)
         ,unnest(string_to_array(value, '#')) WITH ORDINALITY AS a(token, idx) 
    GROUP BY token, idx, recordcount
    HAVING recordcount = count(token)
)
SELECT string_agg(token, '#' ORDER BY idx)
FROM tokenization
```
That tokenization cte is going to check the count for each token (each substring) with respect to its index (its location in the original string) and compare that to the entire record count for the table. Once that check is performed it will string_agg() your tokens back together in order of the index which will give you back the substring of the original strings that matches across all records.
Login or Signup to reply.

Please signup or login to give your own answer.

Click here to cancel reply.