skip to Main Content

I have this regex:

 $text = preg_replace_callback('/(d+.d+|b[A-Z](?:.[A-Z])*b.?)|([.,;:!?)])s*/', function ($matches) {
        return $matches[1] ? $matches[1] : $matches[2] . ' ';
    }, $text);

Which targets the end of sentences and avoids abbreviations like N.B.C.

It works fine. The problem is that it doesn’t detect tree dots ... or the ellipsis symbol as the end of the sentence.

How can I adjust the regex to include it as well?

2

Answers


  1. You can modify your regular expression to include three dots (...) or the ellipsis symbol () by adding a new group to the pattern that specifically looks for those characters.

    Here’s the adjusted regex:

    $text = preg_replace_callback('/(d+.d+|b[A-Z](?:.[A-Z])*b.?)|(.{3}|…|[.,;:!?)])s*/', function ($matches) {
        return $matches[1] ? $matches[1] : $matches[2] . ' ';
    }, $text);
    

    This pattern now includes a subgroup (.{3}|…) that looks for either three dots or the ellipsis character. It will match any of these symbols and replace them with themselves followed by a space, just like the other punctuation in your original pattern.

    Login or Signup to reply.
  2. If you want to add a space after one of the punctuations, but not for the digits or the abbreviations and then only after 1 or 3 dots, you could make use of SKIP FAIL and K

    In the replacement you could then use a space and use preg_replace

    b(?:d+(?:.d+)+b|[A-Z](?:.[A-Z])*b.?)(*SKIP)(*F)|(?:[,;:!?…]+(?=[^s,;:!?…])|(?<!.)(?:.{3}|.)(?=[^s.]))K
    

    The pattern matches:

    • b A word boundary to prevent a partial match
    • (?: Non capture group for the alternatives
      • d+(?:.d+)+b Match 1+ digits and repeat 1+ times . and 1+ digits followed by a word boundary
      • | Or
      • [A-Z](?:.[A-Z])*b.? Match a single char A-Z and optionally repeat . and a char A-Z and an optional dot
    • ) Close the non capture group
    • (*SKIP)(*F) Skip the match
    • | Or
    • (?: Non capture group
      • [,;:!?…]+(?=[^s,;:!?…]) Match 1+ times any of the listed characters, and assert that to the right there is a non whitespace char except being one of the listed characters
      • | Or
      • (?<!.) Negative lookbehind, assert not a dot directly to the elft
      • (?:.{3}|.) Match either 3 or 1 dots
      • (?=[^s.]) Positive lookahead, assert a non whitespace char to the right, except for a dot
        ) Close the non capture group
    • K Forget what is matched so far

    Regex demo | PHP demo

    For example

    $pattern = '/b(?:d+(?:.d+)+|[A-Z](?:.[A-Z])*b.?)(*SKIP)(*F)|(?:[,;:!?…]+(?=[^s,;:!?…])|(?<!.)(?:.{3}|.)(?=[^s.]))K/';
    $text = preg_replace($pattern, " ", $string);
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search