Html - how to grep a specific passage in a line in accordance to another line

MystikReasons
June 16, 2024
173 views
0 votes
2 Answers

I’m currently trying to grep the content of href from an html text. The problem is that href is listed multiple times in the file and therefore I need to grep a line above it.

        <tr>
            <th>Description:</th>
            <td class="wrap" itemprop="description">Crypto library written in C++ (legacy version)</td>
        </tr><tr>
            <th>Upstream URL:</th>
            <td><a itemprop="url" href="https://botan.randombit.net/"
                    title="Visit the website for botan2">https://botan.randombit.net/</a></td>
        </tr><tr>
            <th>License(s):</th>
            <td class="wrap">BSD</td>
        </tr>

What I’m currently trying to do is to grep

upstream_url=$(echo "$html_content" | grep -oP '<th>Upstream URL:</th>s*<td><a .*?href="K[^"]+')

This should theoretically get the following result: https://botan.randombit.net
But the result is ”.

I reduced the code to

upstream_url=$(echo "$html_content" | grep -oP '<th>Upstream URL:</th>s*')

and this will get me <th>Upstream URL:</th> but as soon as I try to add <td> the result again is ‘ ‘.

It seems that it has a problem with the line break.

Does someone know what I’m doing wrong here?

Edit: This is the html file I’m using.
https://www.swisstransfer.com/d/79b85ae5-9e63-466f-83df-1ad5c15d11b1 (I renamed it to .txt otherwise I couldn’t have uploaded it)

Answers

- ticktalk
- June 14, 2024 at 3:02 am
- 0 votes
0
See if the below hack gets you to what you’re after.
As per @shawn, using one of the xml toolset (such as xmllint is a ‘better’ way to do this, and investing your time and effort to gain some familiarity with it will pay back manifold down the line.
```
cat mystik.awk
BEGIN{ found=0 }
/<th>Upstream URL:</th>/ { found=1 }

found == 1 && /href="/ { idx=split($0, flds, "=" ); print flds[idx] ; exit}

#
# test 
#
awk -f mystik.awk html_content.txt 
"https://botan.randombit.net/"
```
as always, test/check/double-check any/all code examples offered.

NB: would this work on other html – highly unlikely, so don’t expect it to.

as a bonus, here’s a couple of examples using xmllint (which I force myself to use occasionally when awking would be too labourious ) 🙂
```
xmllint --html --xpath '//tr[th[text()="Upstream URL:"]]/td/a[@itemprop="url"]/@href' html_content.txt
 href="https://botan.randombit.net/"

#
# more 'concise' - for this search specifically !!
#
xmllint --html --xpath '//td/a[@itemprop="url"]/@href' html_content.txt
 href="https://botan.randombit.net/"
```
hope this helps
Login or Signup to reply.

- EdMorton
- June 16, 2024 at 1:07 pm
- 0 votes
0
Using GNU awk for multi-char RS, the 3rd arg to match(), and s shorthand with the regexp provided in the question:
```
$ awk -v RS='^$' '
    match($0, /<th>Upstream URL:</th>s*<td><a .*href="([^"]+)"/, a) {
        print a[1]
        exit
    }
' file
https://botan.randombit.net/
```
That will read the whole of the input into memory. If your input is too big for that then, given input that looks like your posted sample input, you could do this instead to just read one </td> or </th> separated record at a time, still with GNU awk:
```
$ awk -v RS='</t[dh]>' '
    (prev ~ /<th>Upstream URL:$/) && match($0, /<td><a .*href="([^"]+)"/, a) {
        print a[1]
        exit
    }
    { prev = $0 }
' file
https://botan.randombit.net/
```
Alternatively, using any awk and only reading 1 line at a time:
```
$ awk '
    /<th>.*</th>/ { f = (/Upstream URL:/) }
    f && sub(/.*<td><a .*href="/,"") && sub(/".*/,"")  {
        print
        exit
    }
' file
https://botan.randombit.net/
```
Login or Signup to reply.

Please signup or login to give your own answer.

Click here to cancel reply.

Html – how to grep a specific passage in a line in accordance to another line

Answers