List in a text file all urls contained in specific html A tag

JD76
September 18, 2023
109 views
3 votes
3 Answers

I’m trying to list all urls from a webpage but only the ones in a specific html A tag.

for example the urls targeted are the ones in html A tags containing "Info science":

bunch of html before

<a rel="external nofollow" title="something" href="https://url1&amp;rl=h1" target="_blank">
<strong class="doc" style="color:#000000">Info</strong> science</a>

<a rel="external nofollow" title="something" href="https://url2&amp;rl=h1" target="_blank">
<strong class="image" style="color:#000000">Info</strong> bio</a>

<a rel="external nofollow" title="something" href="https://url3&amp;rl=h1" target="_blank">
<strong class="doc" style="color:#000000">Info</strong> science</a>

<a rel="external nofollow" title="something" href="https://url4&amp;rl=h1" target="_blank">
<strong class="image" style="color:#000000">Info</strong> bio</a>

bunch of html after

the end result should be:
list_links.txt

https://url1&amp;rl=h1
https://url3&rl=h1

I tried in a linux terminal:
lynx -dump https://website_to_scrape | awk '/https:/{print $2}' > list_links.txt

but I of course I get all urls on the web page. Which is already a step forward.

I also tried
grep -r science source.html > list_links.txt
but I get a list of all html strong tags containing the word science.

My knowledge is limited but it would be great to have a solution in bash script or something I could do from the terminal.
I’m open for other solutions of course.

Answers

In Python…

Given the HTML data as a string then:

from bs4 import BeautifulSoup as BS

html = """
<a rel="external nofollow" title="something" href="https://url1&amp;rl=h1" target="_blank">
<strong class="doc" style="color:#000000">Info</strong> science</a>

<a rel="external nofollow" title="something" href="https://url2&amp;rl=h1" target="_blank">
<strong class="image" style="color:#000000">Info</strong> bio</a>

<a rel="external nofollow" title="something" href="https://url3&amp;rl=h1" target="_blank">
<strong class="doc" style="color:#000000">Info</strong> science</a>

<a rel="external nofollow" title="something" href="https://url4&amp;rl=h1" target="_blank">
<strong class="image" style="color:#000000">Info</strong> bio</a>
"""
soup = BS(html, 'lxml')

for a in soup.find_all('a', href=True):
    if a.getText().strip() == 'Info science':
        print(a['href'])

Output:

https://url1&rl=h1
https://url3&rl=h1

- GillesQu233not
- September 4, 2023 at 6:07 pm
- 0 votes
0
Using xidel and xpath in command line in a shell (don’t use awk nor regex):
```
xidel -e '//a[contains(., "Info science")]/@href' -s <FILE or URL>
```
Required output
```
https://url1&rl=h1
https://url3&rl=h1
```
Works for Linux, MacOs, Windows, BSD…

Check https://github.com/benibela/xidel
Login or Signup to reply.

- RomanPerekhrest
- September 4, 2023 at 6:08 pm
- 0 votes
0
Use xmllint tool with specific xpath pattern:
```
xmllint --html --xpath "//a[contains(., 'Info science')]/@href" source.html
```
```
 href="https://url1&amp;rl=h1"
 href="https://url3&amp;rl=h1"
```
To extract raw urls pipe to sed substitution:
```
xmllint --html --xpath "//a[contains(., 'Info science')]/@href" source.html 
| sed 's/href=|"//g'
```
```
https://url1&amp;rl=h1
https://url3&rl=h1
```
Login or Signup to reply.

Please signup or login to give your own answer.

Click here to cancel reply.

List in a text file all urls contained in specific html A tag

Answers

Using xidel and xpath in command line in a shell (don’t use awk nor regex):

Required output

Using xidel and xpath in command line in a shell (don’t use `awk` nor regex):