Let’s say I have a file, called courses.txt with contents like below.
the file has sections(course providers and my email used) followed by various courses. example : edX ([email protected]) and then the various course names, each preceded by the serial number.
udemy ([email protected])
"=========================="-
1) foo bar
2) java programming language
3) redis stephen grider
4) javascript
5) react with typescript
6) kotlin
7) Etherium and Solidity : the Complete Developer's Guide
8) reactive programming with spring
coursera ([email protected])
"==========================-"
1) python
2) typescript
3) java concurrency
4) C#
edX ([email protected])
"==========================-"
1) excel
2) scala
3) risk management
4) stock
5) oracle
6) mysql
7) java
==========================-
Question : I want to grep for a course, say "java". I want a match which shows me the particular line(s) of the match(example : "java") and the corresponding section name(say, "edX ([email protected])" ).
if I want to search for "java" what "regex" will give me following matches (I use grep/perl on windows):
<br>
udemy ([email protected])
2) java programming language
coursera ([email protected])
3) java concurrency
edX ([email protected])
7) java
I tried lookbehind/lookahead but couldn’t figure out how to print the course provider name with email and the course name.
Thoughts?
3
Answers
I won’t give you a complete solution, but you can start with this:
Some explanation:
-i
makes it case insensitive-E
uses extended regular expressions|
is an example of those extended regular expressions and it means "OR": show the lines which contain ‘java’ OR ‘@’ (the latter being all the email adresses)As a result, you get a file with all the e-mail addresses, and all the ‘java’ courses, together with a catch: if a line with an e-mail address is followed by another line with an e-mail address, then there’s no ‘java’ course for that address. Hence, you can now use Perl and remove the e-mail addresses where the next line also is an e-mail address.
Looking at the input data we can conclude that section starts with a line which includes email address.
Data for the section starts with serial number.
Based on this information we can build a hash
%sections
with line which includes email as a key, and all lines starting with serial number can be stored in an array under the key.Once the hash is build the code goes through all sections and looks for lines which include search term, if the term found the output section with matching line.
Note: to work on real file replace
<DATA>
with<>
then run as./script.pl filename.dat
Output
If you process in paragraphs (chunks of text separated by blank lines) then in each paragraph it is fairly straightforward to match the needed pattern — the header (followed by a line with
=
‘s) and a line withjava
in it(Tested on Linux; read on for Windows. Broken into lines for easier reading. See below for explanation of the pattern.)
At the end of the line with
=
s I use.+?
instead of the specific characters that follow=
s in your input because your sample input isn’t consistent; it has both-"
and"-
, in different paragraphs. Adjust as suitable.Since this is on Windows, where you may have to use
"
delimiters for the one-liner (I don’t know what shell you use), you may need to replace the literal"
inside the pattern withx22
(hex for"
), or your other favorite sequence.Hopefully good for Windows (can’t test on Windows right now)
The -00 switch makes it read in paragraphs. With the
/x
modifier spaces inside the pattern are ignored so we can use them to space things out for readability. With the/s
modifier the.
matches a newline as well. This is important for the middle.+?
to matche multiple lines, up to the one withjava
(surrounded by spaces).†If you don’t mind having a script instead of a one-liner, what I recommend, then, for example
The <> operator reads files given on the command line, line by line, but the notion of a "line" is earlier set to a paragraph with
local $/ = "nn"
. That local is there in case this is a part of a larger program where you don’t want to change the $/ variable for the whole program!† Or, instead of using
/s
that makes.
match newlines, use a pattern for multiple linesOr, if you need
"..."
on Windows, like(Again, I can’t test on Windows right now.)
Note that now we don’t have to make all those
.+
non-greedy with the added?
(.+?
) like in the patterns with/s
above — now that.+
stops at a newline, just as needed here.Or, use the
/s
modifier dynamically, via extended patternsHere
(?s)
"turns on" the/s
modifier, which would be in effect until the end of the enclosing group (the rest of the pattern in this case), but(?-s)
turns it off.