skip to Main Content

I’m currently trying to grep the content of href from an html text. The problem is that href is listed multiple times in the file and therefore I need to grep a line above it.

        <tr>
            <th>Description:</th>
            <td class="wrap" itemprop="description">Crypto library written in C++ (legacy version)</td>
        </tr><tr>
            <th>Upstream URL:</th>
            <td><a itemprop="url" href="https://botan.randombit.net/"
                    title="Visit the website for botan2">https://botan.randombit.net/</a></td>
        </tr><tr>
            <th>License(s):</th>
            <td class="wrap">BSD</td>
        </tr>

What I’m currently trying to do is to grep

upstream_url=$(echo "$html_content" | grep -oP '<th>Upstream URL:</th>s*<td><a .*?href="K[^"]+')

This should theoretically get the following result: https://botan.randombit.net
But the result is ”.

I reduced the code to

upstream_url=$(echo "$html_content" | grep -oP '<th>Upstream URL:</th>s*')

and this will get me <th>Upstream URL:</th> but as soon as I try to add <td> the result again is ‘ ‘.

It seems that it has a problem with the line break.

Does someone know what I’m doing wrong here?

Edit: This is the html file I’m using.
https://www.swisstransfer.com/d/79b85ae5-9e63-466f-83df-1ad5c15d11b1 (I renamed it to .txt otherwise I couldn’t have uploaded it)

2

Answers


  1. See if the below hack gets you to what you’re after.
    As per @shawn, using one of the xml toolset (such as xmllint is a ‘better’ way to do this, and investing your time and effort to gain some familiarity with it will pay back manifold down the line.

    cat mystik.awk
    BEGIN{ found=0 }
    /<th>Upstream URL:</th>/ { found=1 }
    
    found == 1 && /href="/ { idx=split($0, flds, "=" ); print flds[idx] ; exit}
    
    #
    # test 
    #
    awk -f mystik.awk html_content.txt 
    "https://botan.randombit.net/"
    
    

    as always, test/check/double-check any/all code examples offered.

    NB: would this work on other html – highly unlikely, so don’t expect it to.

    as a bonus, here’s a couple of examples using xmllint (which I force myself to use occasionally when awking would be too labourious ) 🙂

    xmllint --html --xpath '//tr[th[text()="Upstream URL:"]]/td/a[@itemprop="url"]/@href' html_content.txt
     href="https://botan.randombit.net/"
    
    #
    # more 'concise' - for this search specifically !!
    #
    xmllint --html --xpath '//td/a[@itemprop="url"]/@href' html_content.txt
     href="https://botan.randombit.net/"
    
    

    hope this helps

    Login or Signup to reply.
  2. Using GNU awk for multi-char RS, the 3rd arg to match(), and s shorthand with the regexp provided in the question:

    $ awk -v RS='^$' '
        match($0, /<th>Upstream URL:</th>s*<td><a .*href="([^"]+)"/, a) {
            print a[1]
            exit
        }
    ' file
    https://botan.randombit.net/
    

    That will read the whole of the input into memory. If your input is too big for that then, given input that looks like your posted sample input, you could do this instead to just read one </td> or </th> separated record at a time, still with GNU awk:

    $ awk -v RS='</t[dh]>' '
        (prev ~ /<th>Upstream URL:$/) && match($0, /<td><a .*href="([^"]+)"/, a) {
            print a[1]
            exit
        }
        { prev = $0 }
    ' file
    https://botan.randombit.net/
    

    Alternatively, using any awk and only reading 1 line at a time:

    $ awk '
        /<th>.*</th>/ { f = (/Upstream URL:/) }
        f && sub(/.*<td><a .*href="/,"") && sub(/".*/,"")  {
            print
            exit
        }
    ' file
    https://botan.randombit.net/
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search