skip to Main Content

I’m trying to create a regex expression to pick out valid emails from anywhere in a string of text. My current regex works fine for most cases, but the overall length limit of 254 chars (applied using a negative lookahead) stops working when the email is enclosed in brackets (or other characters, e.g. ellipsis).

Is there a way to anchor/limit the lookahead so that it only counts characters captured by a specific group? Or is there some other solution?

My current regex is:

b((?!S{255,})[w.'#%+-]{1,64}@(?:(?=.{1,63}.)[a-z0-9](?:[a-zA-Zd.-]*[a-z0-9])?.)+[a-zA-Z]{2,})

Example below, using an email that hits the maximum chars (254 in my case). The first email (without brackets) gives a match, but the next email (with the brackets) does not match (since the closing bracket is included in the char count). I’d like this example string to result in three matches.

My email is: averylongaddresspartthatalmostwillreachthelimitofcharsperaddress@nowwejustneedaverylongdomainpartthatwill.reachthetotallengthlimitforthewholeemailaddress.whichis254charsaccordingtothePHPvalidate-email-filter.extendingthetestlongeruntilwereachtheright.com

You can contact me by email (averylongaddresspartthatalmostwillreachthelimitofcharsperaddress@nowwejustneedaverylongdomainpartthatwill.reachthetotallengthlimitforthewholeemailaddress.whichis254charsaccordingtothePHPvalidate-email-filter.extendingthetestlongeruntilwereachtheright.com)

This also won't match: averylongaddresspartthatalmostwillreachthelimitofcharsperaddress@nowwejustneedaverylongdomainpartthatwill.reachthetotallengthlimitforthewholeemailaddress.whichis254charsaccordingtothePHPvalidate-email-filter.extendingthetestlongeruntilwereachtheright.com...

This email is too long averylongaddresspartthatalmostwillreachthelimitofcharsperaddress@nowwejustneedaverylongdomainpartthatwill.reachthetotallengthlimitforthewholeemailaddress.whichis254charsaccordingtothePHPvalidate-email-filter.extendingthetestlongeruntilwereachthewronglength.com (so it should not result in a match)

2

Answers


  1. Is there a way to anchor a negative lookahead to a particular group?

    No. Lookahead is independent from groups. The best you can do is limit it by internal pattern.

    But this question is quite loosely related to described problem.

    Your lookahead overmatches parenthesis. It shouldn’t use S, as it includes way more symbols, that your pattern allows.

    Use (?![w.@'#%+-]{255,}) instead to check length only based on symbols allowed by pattern itself.

    Demo can be seen here.

    Login or Signup to reply.
  2. To do the trick:

    • remove the negative lookahead that checks the length.
    • put the full pattern in a lookahead (without the leading word-boundary).
    • in the same lookahead, at the end, add a capture group to capture all until the end of the line.
    • after the lookahead, write for example S{3,254} (allowed length) and check using a reference in a lookahead if the end of the line is the same as the one you have captured.

    result:

    /b(?=w[w.'#%+-]{0,63}@(?:(?=[^.s]{1,63}.)[a-z0-9](?:[a-zA-Zd.-]*[a-z0-9])?.)+[a-zA-Z]{2,}(.*))S{3,254}(?=1$)/gm
    

    demo

    This works because lookaheads are atomic, that means: for a same starting position, once the closing bracket of a lookahead is reached, backtracking is no more possible inside it, and the content of capture groups inside can’t be changed.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search