skip to Main Content

I have multiple strings and I need to remove repeated chars. For example: the string here abbbbbc x should become here abc x or the string test jjka should become test jka.

After studying, I came up with the code below which works fine (it uses PHP but you can use any language):

echo preg_replace("/([a-z])\1+/","$1","test ajjjo new");

The code above will output test ajo new which is great!

My problem now, is that I need to only replace the repeated chars if they are inside a word or at the beggining of end of the word. For example: I need the string here bbb cca to become here bbb ca and the string test hjjjja ppp to become test hja ppp. I tried negating the (space) and ^ and $ but it all becomes a mess pretty fast.

How would you recommend me?

2

Answers


  1. Simpler solution, as I thought there ought to be (making use of the "best regex trick ever" (https://www.rexegg.com/regex-best-trick.html):

    b(?<whole_word>[a-z])k{whole_word}++b(*SKIP)(*FAIL)|(?<not_whole_word>[a-z])k{not_whole_word}++
    

    which is the exact same (but less compact than what @Wiktor Stribiżew commented):

    b([a-z])1+b(*SKIP)(*F)|([a-z])2+
    

    and replace with:

    $not_whole_word
    

    See: https://regex101.com/r/pa0GjG/1


    Explaination:

    • b if you find a whole word, ie. a word boundary
    • (?<whole_word>[a-z])k{whole_word}++ followed by a character that makes up the whole word until the
    • b end of the word
    • (*SKIP)(*FAIL) then not match
      • | in every other case
    • (?<not_whole_word>[a-z]) match a character that is
    • k{not_whole_word}++ repeated

    OLD IDEA


    You could use:

    (?:(b)|B)(?!k{char})(?<anything>.)(?<char>[a-z])k{char}++(?(1)B)
    
    

    and replace with

    $anything$char
    

    See: https://regex101.com/r/yCNKY1/1

    I guess there is a more obvious answer but this should work also.


    • (?:(b)|B) check, whether you are at the beginning of a word or not. If so group 1 will be set.
      • (?!k{char}) check that the character of interest is not preceeded by itself
      • (?<anything>.) i.e. it must be preceeded by anything other
        • (?<char>[a-z]) match the character
        • k{char}++ match all number of repetitions and do not give them up
    • (?(1)B) ensure, that if the start of the match was the start of a word, you are now not at the end -> you cannot match a complete word.
    Login or Signup to reply.
  2. You can replace each match of the regular expression

    (?<!S)([a-z])1*(?!S)K|([a-z])(?=2)
    

    with an empty string.

    Demo

    Though I don’t know PHP I gather that some or all of the backslashes may need to be doubled.


    The regular expression can be broken down as follows.

    (?<!S)  # negative lookbehind asserts that the following character is not
             # preceded by a character other than a whitespace.
    ([a-z])  # match a lowercase letter and save it to capture group 1
    1*      # match zero of more instances of the character in capture group 1
    (?!S)   # negative lookahead asserts that the following character is not a
             # character other than a whitespace.
    K       # reset the start of the match and discard all previously-consumed
             # characters
    |        # or
    ([a-z])  # match a lowercase letter and save it to capture group 2
    (?=2)   # positive lookahead asserts that following character equals the
             # contents of capture group 2
    

    Note that (?<!S) could replaced with (?<=^|s) and (?!S) replaced with (?!=s|$), but I understand what I have is the more efficient of the alternatives.

    You could also hover the cursor over each part of the regex at the regex101.com link to obtain an explanation of its function.


    If, for example, the string were

    "aaa abbb bba abbba ababbbab"
          ^^  ^    bb      ^^  
    

    "aaa" would match (?<!S)([a-z])1*(?!S) but then K would cause those three characters to be discarded from the match that is returned, and would reset the start of the match to the location immediately before the space that follows "aaa". | terminates that (zero-width) match which therefore is replaced with an empty string, resulting in no change to that part of the string.

    Each of the following single characters marked above with a caret (^) is matched by the second part of the alternation (([a-z])(?=2)) and therefore is replaced with an empty string.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search