I am trying to achieve this and I don’t believe it can’t be done with regex, albeit I would have probably done it with scripting already …
I have a bunch of text files that have multi-line text that fits in between lines that start with a dash (as the first non-blank character on that line). So each such multi-line text might have, on the last line before the dash-starting line below, a line that starts with ">" and contains, somewhere on that line, a hashtag.
Here is one such example:
- ffff
- aaa
* bbb
* ccc
- tttt
* aaa
* bbb
> #tag
- tttt
* aaaa
* bbbb
> #log
-
I have managed to get the text in between dashes but I can’t seem to be able, for the life of me, to speciufy another rule, that in between those dashes there should be a line starting with ">" and containing the tag "#log" -for example:
(^s*)(-[sSn]+?(?=^s*-))
Is there a way, with REGEX, to match each multiline text in between dash-starting lines that contains, on it’s last line, ">.#log.$"?
I tried many many variations like tis one:
(^s*)(-[sSn]+?(?=>.*#log.*$ns*-))
This basically gets everything (greedy?), from the first dash in the file up to the line starting eith ">" that is followed by a newline starting with dash. So it does what it says it dows I just don’t know how to tell it that I don’t want any lines starting with dash in the match.
I somehow want to tell it to only consider the last dash-starting line (non-greedy?)…
Thank you for saving my sanity 🙂
Edit: tried to add this in a comment but it is very quirky:
expected output is:
- whatever text on n
any number of lines which might have "-" dashes in there, n
* lists n
* sublist n
etc n
also some #tags and the whole "block" ends with the following line n
> #tag1 #tag2 .... #log ... [maybe some links at the end, etc] n
So basically everything between a dash and until the line that starts with > with no other intermediary lines that start with a dash (a dash might be there inside the line though)
I've put "n" at the end of the lines because code block is not possible
3
Answers
I think you want to filter out this area of your input text:
So, for this regex should be:
https://regex101.com/r/WSVznR/2
If you want to filter out data within
hyphen(-)
and#log
, then your regex should be:https://regex101.com/r/SJuUK2/2
If it’s contains some other special characters then add them accordingly.
Here, I’m used Regex Lookahead and Lookbehind concept
Thank you @RicardoSouza for your comment, which is really a valid scenario, which I completely escape before.
So, I update my changes according to that.
If I understood your problem correctly, you want to select the text from lines starting with a dash
-
to the closest line that starts with a greater-than sign>
and contains the hashtag#log
anywhere in that line.If that’s the case, a possible solution is the following
This matches a line starting with a
-
, then matches any following lines starting with anything that is not-
or>
, until if finds a line starting with>
that contains#log
in it.https://regex101.com/r/e95h9x/1
This should take care of it, I tested in VSCode against your sample input and got two matches – the first chunk from * aaa to > #tag and then the second chunk from * aaaa to > # log
(?<=-.+n)(?:[^-#]|n)+#.+
So this will use a lookbehind to find a line starting with a hyphen but not include it, followed by a series of any character that’s not a
#
or-
, including newlines, up until it gets to a#
, and then anything else out to the end of that line