I need a regex to find all the occurrences ( could be multiple ) of an tag with the text: "Graphic source" and transform it to an img tag with the src attribute that contains the href url.
So FROM
<small><a href="https://www.url.com/image.png" target="_blank" rel="noopener">Graphic source</a></small>
TO
<img src="https://www.url.com/image.png"/>
So for example:
Some text
Other tag <b>test</b>
<small><a href="https://www.url.com/name1.png" target="_blank" rel="noopener">Graphic source</a></small>test
<small><a href="https://www.url.com/name2.jpg" target="_blank" rel="noopener">Graphic source</a></small>Text text<small><a href="www.url.com">Do not transform</a></small>
Needs to be transformed as:
Some text
Other tag <b>test</b>
<img src="https://www.url.com/name1.png"/>test
<img src="https://www.url.com/name2.jpg"/>Text text<small><a href="www.url.com">Do not transform</a></small>
I almost got it working:
<small.*?href="(.*?)"
I don’t understand how to NOT include the a tag that do not contains the words Graphic source as text and how to NOT include all the other attributes of the a tag when transformed to img tag.
4
Answers
Obligatory disclaimer: Stop Parsing (X)HTML with Regular Expression
https://regex101.com/r/2Wd9le/1
This should do the job:
For your replacement you could do:
Don’t use
regex
to parseHTML/XML
Better use a programming language and proper libraries to parse
HTML
.With one of the most used language, Python:
Output
The regex pattern itself won’t replace any values, it simply matches.
You’ll need to use a program or programming language.
Assert that the text following the
>
is "Graphic source<"The substitution text would be,
Also, I presume you could use
s*
preceding and following the text.In this type of situation, where there are repeated keys and values, you can use the lazy-quantifier,
?
, to match up to the first encountered quotation-mark.For example,
Here is an example output