I would like to be able to return all canadian postalcode which can be find in using the following regex(migth have a space like: A0A 0A0 or A0A0A0)
regex [A-Z]d[A-Z]s*d[A-Z]d
The problem is when there is more than 1 match that a part of it it in .
I’ve played with capturing group (too much) and i’m more lost then at the start, oh well.
-
Given the string: SOMETEXTA0A 0A0OTHERSTUFF
- I would expect the following match
- A0A 0A0 (ok)
- I would expect the following match
-
Given the string: SOMETEXTB1A0A0A0OTHERSTUFFB2A 0B0OTHER
- I would expect the following match:
- B1A0A0 (ok)
- A0A0A0 (not found)
- B2A 0B0 (not found)
- I would expect the following match:
Here is the regex101 https://regex101.com/r/ZbrwnP/1
Here is my test setup for postgresql.
with barcodes as(
select *
from
(values
('AAAC9B92WLEDH0G5R0Z0|2024-01-01T01:59:39.379-05')
, ( 'SOMETEXTJ9H5G0L2J0OTHERSTUFF')
)b(barcode)
)
select *
from barcodes b
left join lateral (
SELECT UNNEST(
REGEXP_MATCHES(
b.barcode
, '[A-Z]d[A-Z]s*d[A-Z]d'
, 'g'
)
) match
)t on true
2
Answers
In your example
SOMETEXTB1A0A0A0OTHERSTUFFB2A 0B0OTHER
…B1A0A0
is found.A0A0A0
is not found because it overlaps withB1A0A0
.B2A 0B0
is found.Normally one deals with overlaps by using a capture group inside a positive look ahead like
(?=([A-Z]d[A-Z]s*d[A-Z]d))
. Here is a demonstration. Unfortunately in PostgreSQL…All you get is empty strings.
You can try installing
pgpcre
which gives you Perl Compatible Regular Expressions.This may be something better done by another programming language. Retrieve the data, parse it, and (if you need to do this more than once) insert the parsed postal codes back into the database.
You can achieve the result you want directly in SQL, although it requires a bit of work. Rather than attempting to match the entire regex, we match on just the first character of the regex, with a lookahead assertion to match the balance of the regex. This ensures we find all locations in the string which can match the regex, whether they overlap or not. So we initially use this regex:
We use this first in
regexp_count
, to see how many matches there are in the string, and usegenerate_series
to generate a sequence of numbers which represent each individual match. We then useregexp_instr
to find the starting position of each match in the string. Finally, we useregexp_substr
with the starting positions and the original regex to extract the matches:Output:
Note I’ve split out the steps individually to make it clear how this works, you could easily combine some of the steps to reduce the number of operations. You could also aggregate the results into an array if desired.
Demo on dbfiddle.uk