skip to Main Content

Here is a test file contains links and names within the <a></a> tags.

/tmp/test_html.txt

<tr>
<td><a href="http://www.example.com/link1">example link 1</a></td>
</tr>
<tr>
<td><a href="http://www.example.com/link2">example link 2</a></td>
</tr>
<tr>
<td><a href="http://www.example.com/link3">example link 3</a></td>
</tr>
<tr>
<td><a href="https://www.example.com/4/0/1/40116601-1FDC-real-world-link/bar" target="_blank" class="real-world-class">Real World Link</a>&nbsp;</td>
</tr>

The following command can find out all links from the file, but it can’t print the link and name together:

How to strip out all of the links of an HTML file in Bash or grep or batch and store them in a text file

# sed -n 's/.*href="([^"]*).*/1/p' /tmp/test_html.txt

I want the command can print all links line by line with the name first, and then following the href.

Here is the expected output:

# sed <...command....> /tmp/test_html.txt

example link 1 | http://www.example.com/link1
example link 2 | http://www.example.com/link2
example link 3 | http://www.example.com/link3
Real World Link | https://www.example.com/4/0/1/40116601-1FDC-real-world-link/bar

How to write the sed command?

2

Answers


  1. This solution seems to work; please mark as correct or post a comment to explain why it is not correct; thanks!

    cat input3 | sed -n 's/^.*<a href="(.*)">example link( [0-9][0-9]*)</a></td>$/example link2 | 1/p'
    
    Login or Signup to reply.
  2. This might work for you (GNU sed):

    sed -En 's/.*href="([^"]*)"[^>]*>([^<]*)<.*/2 | 1/p' file
    

    Filter lines using the -n option and make regexp easier using -E option.

    Match on lines containing href followed by inner text and format as required using back references.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search