skip to Main Content

I need a regex to find all the occurrences ( could be multiple ) of an tag with the text: "Graphic source" and transform it to an img tag with the src attribute that contains the href url.

So FROM

<small><a href="https://www.url.com/image.png" target="_blank" rel="noopener">Graphic source</a></small>

TO

<img src="https://www.url.com/image.png"/>

So for example:

Some text
Other tag <b>test</b>
<small><a href="https://www.url.com/name1.png" target="_blank" rel="noopener">Graphic source</a></small>test
<small><a href="https://www.url.com/name2.jpg" target="_blank" rel="noopener">Graphic source</a></small>Text text<small><a href="www.url.com">Do not transform</a></small>

Needs to be transformed as:

Some text
Other tag <b>test</b>
<img src="https://www.url.com/name1.png"/>test
<img src="https://www.url.com/name2.jpg"/>Text text<small><a href="www.url.com">Do not transform</a></small>

I almost got it working:
<small.*?href="(.*?)"

I don’t understand how to NOT include the a tag that do not contains the words Graphic source as text and how to NOT include all the other attributes of the a tag when transformed to img tag.

https://regex101.com/r/OReOCd/1

4

Answers


  1. Obligatory disclaimer: Stop Parsing (X)HTML with Regular Expression

    <small><a href="(.*?)"[^>]*?>Graphic source</a></small>
    

    https://regex101.com/r/2Wd9le/1

    Login or Signup to reply.
  2. This should do the job:

    <small><a href="(https?://[^"]+)"[^>]+>Graphic source</a></small>
    

    For your replacement you could do:

    <img src="$1"/>
    
    Login or Signup to reply.
  3. Don’t use regex to parse HTML/XML

    Better use a programming language and proper libraries to parse HTML.

    With one of the most used language, Python:

    import requests
    from lxml import html
    
    res = requests.get('https://sputnick.fr/downloads/regex-to-transform-link-tag-to-img-tag.html')
    tree = html.fromstring(res.text)
        
    # using proper XPath query language:
    elts = tree.xpath('//a[text()="Graphic source"]')
    
    for a_elt in elts:
        img_elt = html.Element("img", src=a_elt.get("href"))
        a_elt.getparent().replace(a_elt, img_elt)
    
    transformed_html = html.tostring(tree, encoding="unicode")
    
    print(transformed_html)
    

    Output

    <html lang="en">
      <head>
        <title>Example</title>
        <meta charset="utf-8">
        <meta name="viewport" content="width=device-width, initial-scale=1, maximum-scale=1, user-scalable=0">
      </head>
      <body>
    Some text
    Other tag <b>test</b>
    <small><img src="https://www.url.com/name1.png"></small>test
    <small><img src="https://www.url.com/name2.jpg"></small>Text text<small><a href="www.url.com">Do not transform</a></small>
      </body>
    </html>
    
    Login or Signup to reply.
  4. "… I need a regex to find all the occurrences … and transform [them] to an img tag with the src attribute that contains the href url. …"

    The regex pattern itself won’t replace any values, it simply matches.
    You’ll need to use a program or programming language.

    "… I don’t understand how to NOT include the a tag that do not contains the words Graphic source as text …"

    Assert that the text following the > is "Graphic source<"

    <.+?hrefs*=s*("|')(.+?)(?<!\)1.+?>Graphic source<.+>
    

    The substitution text would be,

    <img src="$2"/>
    

    Also, I presume you could use s* preceding and following the text.

    <.+?hrefs*=s*("|')(.+?)(?<!\)1.+?>s*Graphic sources*<.+>
    

    "… and how to NOT include all the other attributes of the a tag when transformed to img tag. …"

    In this type of situation, where there are repeated keys and values, you can use the lazy-quantifier, ?, to match up to the first encountered quotation-mark.

    For example,

    ="(.+?)"
    

    Here is an example output

    Some text
    Other tag <b>test</b>
    <img src="https://www.url.com/name1.png"/>test
    <img src="https://www.url.com/name2.jpg"/>
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search