skip to Main Content

I would like to read the href url that is stored next to the <td>blue</td> element:

<html>
<body>
<table>
    <tr>
        <td>
            <a href="localhost/url1">url1</a>
        </td>
        <td>blue</td>
    </tr>
    <tr>
        <td>
            <a href="localhost/url2">url2</a>
        </td>
        <td>green</td>
    </tr>
</table>
</body>
</html>

I first tried to capture the surrounding <tr> tag, but even this does not work:

#!/bin/bash
HTML_FILE="html_content.html"
tr_tag=$(grep -o '<tr>.*blue.*</tr>' "$HTML_FILE")
echo $tr_tag

My output is always blank. Why?

2

Answers


  1. If you are using GNU grep which supports PCRE, try this:

    grep -z -Po '(?s)<tr>.*?blue.*?</tr>' "$HTML_FILE"
    
    Login or Signup to reply.
  2. Using any awk in any shell on every Unix box and only reading 1 line at a time into memory, assuming your input is always formatted exactly as shown in your question:

    $ awk -F'"' '/<a href="/{url=$2} /<td>blue</td>/{print url}' file
    localhost/url1
    

    Regarding:

    even this does not work:

    grep -o '<tr>.*blue.*</tr>' "$HTML_FILE"

    blue is surrounded by <td>, not <tr>, tags.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search