I have a list of n rows (n < 100) with ‘#’ separated values in postgres (version 13+)
row value
1 'ABC#DEF#123'
2 'ABC#DEF#156'
3 'ABC#DEF#178'
...
These values can be of any length (e.g. ‘ABC#DEFDEF#GHIG#123123#456#678’)
There are no NULL values
Now imagine you split these strings on the ‘#’ separator, I would like to get the maximum prefix (entire items only), or concatenation of matching items FROM THE START (joined back with ‘#’)
so in that case the return row would be:
row output
1 'ABC#DEF' (or 'ABC#DEF#')
but only counting FULL substrings in between separators, for instance
row value
1 'ABC#123'
2 'ADE#456'
should return
row output
1 ''
since ‘ABC’ and ‘ADE’ do not match and should not return ‘A’
we also only want these results to match from the beginning
e.g.
row value
1 'ABC#DEF#123'
2 'ADE#DEF#123'
should return
row output
1 ''
currently I’m using
CREATE OR REPLACE FUNCTION lcp_iterate(_state TEXT, value TEXT)
RETURNS TEXT
AS
$$
SELECT SUBSTRING($2, 1, s - 1)
FROM generate_series(1, LEAST(LENGTH($1), LENGTH($2))) s
WHERE SUBSTRING($1, 1, s) <> SUBSTRING($2, 1, s)
UNION ALL
SELECT LEAST($1, $2)
LIMIT 1;
$$
LANGUAGE 'sql';
DO $$ BEGIN
CREATE AGGREGATE lcp(TEXT) (SFUNC = lcp_iterate, STYPE = TEXT);
EXCEPTION
when sqlstate '42723' then null;
END $$;
but it returns
row output
1 'A'
It would be extremely simple to do in any programming language, but I’m struggling to find a way to do it in Postgre
any help appreciated !
PS: I’m trying to keep my codebase as readable as possible so I’d like to avoid nesting 20 CASE WHEN THEN into each others (since I can’t have more than 20 #)
base SQL to try:
WITH source AS (
SELECT 'ABC#DEF#123' AS value
UNION ALL
SELECT 'ABC#DEF#456'
)
SELECT my_func(*)
FROM source
GROUP BY value
2
Answers
One way is with a
RECURSIVE CTE
:fiddle
Pick the first row.
Then for every substring in
value
check if there is any other row that disagrees. Keep looping till the first mismatch or end of string.The last iteration yields the longest common prefix. (For a single row it’s the whole value.)
The beauty: it breaks early, as soon as the first mismatch is found, and does not look at the rest, however long that may be.
Assuming
id
is thePRIMARY KEY
, andvalue
is definedNOT NULL
, or you need to declare how to deal withnull
values (and do more).I formulated the predicate with
NOT LIKE
to employ a trigram orCOLLATE "C"
index on(value)
. In Postgres 15 or later I would use the "starts with" operator^@
instead. See:The function
split_part()
is instrumental here:Related:
Algorithm for finding the longest prefix
Select rows which are not present in other table
Unnesting your delimited values is still a good option (updated from my comment for your more complicated requirements).
Consider:
That
tokenization
cte is going to check the count for each token (each substring) with respect to its index (its location in the original string) and compare that to the entire record count for the table. Once that check is performed it willstring_agg()
your tokens back together in order of the index which will give you back the substring of the original strings that matches across all records.