I’m trying to create a regex expression to pick out valid emails from anywhere in a string of text. My current regex works fine for most cases, but the overall length limit of 254 chars (applied using a negative lookahead) stops working when the email is enclosed in brackets (or other characters, e.g. ellipsis).
Is there a way to anchor/limit the lookahead so that it only counts characters captured by a specific group? Or is there some other solution?
My current regex is:
b((?!S{255,})[w.'#%+-]{1,64}@(?:(?=.{1,63}.)[a-z0-9](?:[a-zA-Zd.-]*[a-z0-9])?.)+[a-zA-Z]{2,})
Example below, using an email that hits the maximum chars (254 in my case). The first email (without brackets) gives a match, but the next email (with the brackets) does not match (since the closing bracket is included in the char count). I’d like this example string to result in three matches.
My email is: averylongaddresspartthatalmostwillreachthelimitofcharsperaddress@nowwejustneedaverylongdomainpartthatwill.reachthetotallengthlimitforthewholeemailaddress.whichis254charsaccordingtothePHPvalidate-email-filter.extendingthetestlongeruntilwereachtheright.com
You can contact me by email (averylongaddresspartthatalmostwillreachthelimitofcharsperaddress@nowwejustneedaverylongdomainpartthatwill.reachthetotallengthlimitforthewholeemailaddress.whichis254charsaccordingtothePHPvalidate-email-filter.extendingthetestlongeruntilwereachtheright.com)
This also won't match: averylongaddresspartthatalmostwillreachthelimitofcharsperaddress@nowwejustneedaverylongdomainpartthatwill.reachthetotallengthlimitforthewholeemailaddress.whichis254charsaccordingtothePHPvalidate-email-filter.extendingthetestlongeruntilwereachtheright.com...
This email is too long averylongaddresspartthatalmostwillreachthelimitofcharsperaddress@nowwejustneedaverylongdomainpartthatwill.reachthetotallengthlimitforthewholeemailaddress.whichis254charsaccordingtothePHPvalidate-email-filter.extendingthetestlongeruntilwereachthewronglength.com (so it should not result in a match)
2
Answers
No. Lookahead is independent from groups. The best you can do is limit it by internal pattern.
But this question is quite loosely related to described problem.
Your lookahead overmatches parenthesis. It shouldn’t use
S
, as it includes way more symbols, that your pattern allows.Use
(?![w.@'#%+-]{255,})
instead to check length only based on symbols allowed by pattern itself.Demo can be seen here.
To do the trick:
S{3,254}
(allowed length) and check using a reference in a lookahead if the end of the line is the same as the one you have captured.result:
demo
This works because lookaheads are atomic, that means: for a same starting position, once the closing bracket of a lookahead is reached, backtracking is no more possible inside it, and the content of capture groups inside can’t be changed.