Reading The Unix Programming Environment by Brian W. Kernighan and Rob Pike.
Working on an exercise, where one should find all palindromes in a wordlist using grep -E
.
The most concise solution I’ve been able to find online and understand would be
'^(.?)(.?)(.?)(.?)(.?).?54321$'
, which should find all palindromes up to 11 characters.
For some reason though, grep in my terminal gives me results that I can’t explain:
me@machine:~/test$ cat sample
a
ab
abba
abcdef
abcba
zufolo
me@machine:~/test$ grep -E '^(.?)(.?)(.?)(.?)(.?).?54321$' sample
a
abba
abcba
zufolo
me@machine:~/test$ grep -E '^(.?)(.?)(.?)(.?).?4321$' sample
a
abba
abcba
Why would it add zufolo
to the results (while not coloring any matches as it does with the actual palindromes)? Why does zufolo
disappear when I reduce the regex to up to 9 characters?
output
Does it have anything to do with my setup (the most standard one)?
me@machine:~/test$ ls -l /proc/$$/exe
lrwxrwxrwx 1 me me 0 jan 15 18:37 /proc/3528/exe -> /usr/bin/bash
me@machine:~/test$ cat /proc/version
Linux version 6.5.0-14-generic (buildd@lcy02-amd64-110) (x86_64-linux-gnu-gcc-12 (Ubuntu 12.3.0-1ubuntu1~22.04) 12.3.0, GNU ld (GNU Binutils for Ubuntu) 2.38) #14~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Mon Nov 20 18:15:30 UTC 2
P.S. zufolo
is from wamerican-insane
where I’m actually testing it. There are plenty of others like xenian
, tartar
.
I tried researching online, modifying regex, but can’t get my head around this one.
2
Answers
I guess the actual answer to your question is, there’s nothing wrong with your regex; you appear to have uncovered a bug in
grep
.Demo of the regex working: https://regex101.com/r/RueEm0/1
(Though keep in mind that regex101.com does not support POSIX ERE specifically. The demo uses PCRE2 which is what the site offers by default; but the regex does not use any PCRE features.)
Strangely enough, removing the
$
from the end of the regexp (i.e. making it less restrictive) produces fewer matches, which is the opposite of what it should do:a) With the
$
at the end of the regexp:b) Without the
$
at the end of the regexp:It’s not just GNU grep that behaves strangely, GNU sed has the same behavior from the question when just matching with
sed -nE '/.../p' sample
as GNUgrep
does AND sed behaves differently if we’re just doing a match vs if we’re doing a match + replace.For example here’s
sed
doing a match+replacement and behaving the same way asgrep
above:a) With the
$
at the end of the regexp:b) Without the
$
at the end of the regexp:but here’s sed just doing a match and behaving differently from any of the above:
a) With the
$
at the end of the regexp (note the extraab
in the output):b) Without the
$
at the end of the regexp (note the extraab
andabcdef
in the output):Also interestingly this:
outputs:
the last line of which means the regexp is apparently matching the start of the line and ignoring the
$
end-of-string metachar present in the regexp!The odd behavior isn’t just associated with using
-E
, though, if I remove-E
and just use POSIX compliant BREs then:a) With the
$
at the end of the regexp:b) Without the
$
at the end of the regexp:and again just doing a match in sed below behaves differently from the sed match+replacements above:
a) With the
$
at the end of the regexp:b) Without the
$
at the end of the regexp:The above shows that, given the same regexp, sed is apparently matching different strings depending on whether it’s doing a substitution or not.
So, there appear to be bugs in GNU grep and GNU sed without the
-E
. Or maybe the above is all just undefined behavior given those regexps and so any grep or sed can do whatever they like with them.