How to extract surrounding html with bash?

membersound
June 26, 2024
210 views
0 votes
2 Answers

I would like to read the href url that is stored next to the <td>blue</td> element:

<html>
<body>
<table>
    <tr>
        <td>
            <a href="localhost/url1">url1</a>
        </td>
        <td>blue</td>
    </tr>
    <tr>
        <td>
            <a href="localhost/url2">url2</a>
        </td>
        <td>green</td>
    </tr>
</table>
</body>
</html>

I first tried to capture the surrounding <tr> tag, but even this does not work:

#!/bin/bash
HTML_FILE="html_content.html"
tr_tag=$(grep -o '<tr>.*blue.*</tr>' "$HTML_FILE")
echo $tr_tag

My output is always blank. Why?

Tags: bash html

Answers

- Philippe
- June 26, 2024 at 11:21 am
- 0 votes
0
If you are using GNU grep which supports PCRE, try this:
```
grep -z -Po '(?s)<tr>.*?blue.*?</tr>' "$HTML_FILE"
```
Login or Signup to reply.

- EdMorton
- June 26, 2024 at 1:03 pm
- 0 votes
0
Using any awk in any shell on every Unix box and only reading 1 line at a time into memory, assuming your input is always formatted exactly as shown in your question:
```
$ awk -F'"' '/<a href="/{url=$2} /<td>blue</td>/{print url}' file
localhost/url1
```
Regarding:

even this does not work:
…
grep -o '<tr>.*blue.*</tr>' "$HTML_FILE"

blue is surrounded by <td>, not <tr>, tags.
Login or Signup to reply.

Please signup or login to give your own answer.

Click here to cancel reply.