I have multiple strings and I need to remove repeated chars. For example: the string here abbbbbc x
should become here abc x
or the string test jjka
should become test jka
.
After studying, I came up with the code below which works fine (it uses PHP but you can use any language):
echo preg_replace("/([a-z])\1+/","$1","test ajjjo new");
The code above will output test ajo new
which is great!
My problem now, is that I need to only replace the repeated chars if they are inside a word or at the beggining of end of the word. For example: I need the string here bbb cca
to become here bbb ca
and the string test hjjjja ppp
to become test hja ppp
. I tried negating the
(space) and ^
and $
but it all becomes a mess pretty fast.
How would you recommend me?
2
Answers
Simpler solution, as I thought there ought to be (making use of the "best regex trick ever" (https://www.rexegg.com/regex-best-trick.html):
which is the exact same (but less compact than what @Wiktor Stribiżew commented):
and replace with:
See: https://regex101.com/r/pa0GjG/1
Explaination:
b
if you find a whole word, ie. a word boundary(?<whole_word>[a-z])k{whole_word}++
followed by a character that makes up the whole word until theb
end of the word(*SKIP)(*FAIL)
then not match|
in every other case(?<not_whole_word>[a-z])
match a character that isk{not_whole_word}++
repeatedOLD IDEA
You could use:
and replace with
See: https://regex101.com/r/yCNKY1/1
I guess there is a more obvious answer but this should work also.
(?:(b)|B)
check, whether you are at the beginning of a word or not. If so group 1 will be set.(?!k{char})
check that the character of interest is not preceeded by itself(?<anything>.)
i.e. it must be preceeded by anything other(?<char>[a-z])
match the characterk{char}++
match all number of repetitions and do not give them up(?(1)B)
ensure, that if the start of the match was the start of a word, you are now not at the end -> you cannot match a complete word.You can replace each match of the regular expression
with an empty string.
Demo
Though I don’t know PHP I gather that some or all of the backslashes may need to be doubled.
The regular expression can be broken down as follows.
Note that
(?<!S)
could replaced with(?<=^|s)
and(?!S)
replaced with(?!=s|$)
, but I understand what I have is the more efficient of the alternatives.You could also hover the cursor over each part of the regex at the regex101.com link to obtain an explanation of its function.
If, for example, the string were
"aaa"
would match(?<!S)([a-z])1*(?!S)
but thenK
would cause those three characters to be discarded from the match that is returned, and would reset the start of the match to the location immediately before the space that follows"aaa"
.|
terminates that (zero-width) match which therefore is replaced with an empty string, resulting in no change to that part of the string.Each of the following single characters marked above with a caret (
^
) is matched by the second part of the alternation (([a-z])(?=2)
) and therefore is replaced with an empty string.