How to find out all <a> tag links and names from html file - Ubuntu

stackbiz
January 31, 2023
310 views
1 vote
2 Answers

Here is a test file contains links and names within the <a></a> tags.

/tmp/test_html.txt

<tr>
<td><a href="http://www.example.com/link1">example link 1</a></td>
</tr>
<tr>
<td><a href="http://www.example.com/link2">example link 2</a></td>
</tr>
<tr>
<td><a href="http://www.example.com/link3">example link 3</a></td>
</tr>
<tr>
<td><a href="https://www.example.com/4/0/1/40116601-1FDC-real-world-link/bar" target="_blank" class="real-world-class">Real World Link</a>&nbsp;</td>
</tr>

The following command can find out all links from the file, but it can’t print the link and name together:

How to strip out all of the links of an HTML file in Bash or grep or batch and store them in a text file

# sed -n 's/.*href="([^"]*).*/1/p' /tmp/test_html.txt

I want the command can print all links line by line with the name first, and then following the href.

Here is the expected output:

# sed <...command....> /tmp/test_html.txt

example link 1 | http://www.example.com/link1
example link 2 | http://www.example.com/link2
example link 3 | http://www.example.com/link3
Real World Link | https://www.example.com/4/0/1/40116601-1FDC-real-world-link/bar

How to write the sed command?

Answers

- Andrew
- January 31, 2023 at 12:41 pm
- 0 votes
0
This solution seems to work; please mark as correct or post a comment to explain why it is not correct; thanks!
```
cat input3 | sed -n 's/^.*<a href="(.*)">example link( [0-9][0-9]*)</a></td>$/example link2 | 1/p'
```
Login or Signup to reply.

- potong
- January 31, 2023 at 1:14 pm
- 0 votes
0
This might work for you (GNU sed):
```
sed -En 's/.*href="([^"]*)"[^>]*>([^<]*)<.*/2 | 1/p' file
```
Filter lines using the -n option and make regexp easier using -E option.

Match on lines containing href followed by inner text and format as required using back references.
Login or Signup to reply.

Please signup or login to give your own answer.

Click here to cancel reply.

How to find out all <a> tag links and names from html file – Ubuntu

Answers