I’m currently trying to grep the content of href
from an html text. The problem is that href
is listed multiple times in the file and therefore I need to grep a line above it.
<tr>
<th>Description:</th>
<td class="wrap" itemprop="description">Crypto library written in C++ (legacy version)</td>
</tr><tr>
<th>Upstream URL:</th>
<td><a itemprop="url" href="https://botan.randombit.net/"
title="Visit the website for botan2">https://botan.randombit.net/</a></td>
</tr><tr>
<th>License(s):</th>
<td class="wrap">BSD</td>
</tr>
What I’m currently trying to do is to grep
upstream_url=$(echo "$html_content" | grep -oP '<th>Upstream URL:</th>s*<td><a .*?href="K[^"]+')
This should theoretically get the following result: https://botan.randombit.net
But the result is ”.
I reduced the code to
upstream_url=$(echo "$html_content" | grep -oP '<th>Upstream URL:</th>s*')
and this will get me <th>Upstream URL:</th>
but as soon as I try to add <td>
the result again is ‘ ‘.
It seems that it has a problem with the line break.
Does someone know what I’m doing wrong here?
Edit: This is the html file I’m using.
https://www.swisstransfer.com/d/79b85ae5-9e63-466f-83df-1ad5c15d11b1 (I renamed it to .txt otherwise I couldn’t have uploaded it)
2
Answers
See if the below hack gets you to what you’re after.
As per @shawn, using one of the xml toolset (such as xmllint is a ‘better’ way to do this, and investing your time and effort to gain some familiarity with it will pay back manifold down the line.
as always, test/check/double-check any/all code examples offered.
NB: would this work on other html – highly unlikely, so don’t expect it to.
as a bonus, here’s a couple of examples using xmllint (which I force myself to use occasionally when awking would be too labourious ) 🙂
hope this helps
Using GNU awk for multi-char
RS
, the 3rd arg tomatch()
, ands
shorthand with the regexp provided in the question:That will read the whole of the input into memory. If your input is too big for that then, given input that looks like your posted sample input, you could do this instead to just read one
</td>
or</th>
separated record at a time, still with GNU awk:Alternatively, using any awk and only reading 1 line at a time: