I am trying to build a feature for where I am going through messages between users and attempting to store all U.S. phone numbers that may have possibly been shared in the the message. I want to be very loose about the phone numbers that I store. To do that, I came up with the following regex in PHP (clear explanation given below)
"/(?:+?1[.s-]*)?(?:(?d{1,3})[.s-]*)?(?:d{3}[.s-]+)(?:d{4}[.s-]*)(?:(ext|ext.|Ext|Ext.|extension|Extension)?[.s-]*d{1,6})?|(?:+?1?d{10})/",
(?:+?1[.s-]*)?: This part handles an optional country code (+1) with an optional separator (dot, space, or hyphen). It’s optional because I want to capture phone numbers without the country code as well
(?:(?d{1,3})[.s-]*)?: This part handles an optional area code enclosed in parentheses
(?:d{3}[.s-]+): This part matches the first three digits of the phone number followed by a separator (can be ‘.’ ‘-‘ or spaces)
(?:d{4}[.s-]*): This part matches the next four digits of the phone number followed by an optional separator (can be ‘.’ ‘-‘ or spaces)
(?:(ext|ext.|Ext|Ext.|extension|Extension)?[.s-]*d{1,6})?: This part captures optional extensions (case-insensitive) with an optional separator and up to six digits.
|: This is an alternation operator, allowing the regular expression to match either the pattern before or after it.
(?:+?1?d{10}): This part handles an alternative pattern for phone numbers without explicit separators, where there could be an optional country code (+1) and 10 digits.
However, this regex is a match for the following string
+44 20 7123 4567
where 123 4567
is the match
What should I use to avoid capturing this?
2
Answers
Not sure, if this mtaches all your cases, but if you add
(?!+d{0,2}[^1])
at the beginning, you can ensure that the string doesn’t start with a + symbol followed by up to 2 digits and a character other than 1.It might be possible inside the regular expression, but why not just filter the result in PHP? Not everything has to be solved with a single regular expression.
One problem here is that a look behind assertion (aka "(not) prefixed by …") needs to have a fixed length – but a country code can have different lengths.
I would suggest matching any possible phone number. This would consume characters otherwise matched by partial matches. Then iterate the matches
and use a specific pattern to match an US Phone number in any variant you require.
Note: In the following example I am using the
x
(Extended) modifier. This allows to format, indent and comment the pattern.