skip to Main Content

I’m trying to list all urls from a webpage but only the ones in a specific html A tag.

for example the urls targeted are the ones in html A tags containing "Info science":

bunch of html before

<a rel="external nofollow" title="something" href="https://url1&amp;rl=h1" target="_blank">
<strong class="doc" style="color:#000000">Info</strong> science</a>

<a rel="external nofollow" title="something" href="https://url2&amp;rl=h1" target="_blank">
<strong class="image" style="color:#000000">Info</strong> bio</a>

<a rel="external nofollow" title="something" href="https://url3&amp;rl=h1" target="_blank">
<strong class="doc" style="color:#000000">Info</strong> science</a>

<a rel="external nofollow" title="something" href="https://url4&amp;rl=h1" target="_blank">
<strong class="image" style="color:#000000">Info</strong> bio</a>

bunch of html after

the end result should be:
list_links.txt

https://url1&amp;rl=h1
https://url3&rl=h1

I tried in a linux terminal:
lynx -dump https://website_to_scrape | awk '/https:/{print $2}' > list_links.txt

but I of course I get all urls on the web page. Which is already a step forward.

I also tried
grep -r science source.html > list_links.txt
but I get a list of all html strong tags containing the word science.

My knowledge is limited but it would be great to have a solution in bash script or something I could do from the terminal.
I’m open for other solutions of course.

3

Answers


  1. In Python…

    Given the HTML data as a string then:

    from bs4 import BeautifulSoup as BS
    
    html = """
    <a rel="external nofollow" title="something" href="https://url1&amp;rl=h1" target="_blank">
    <strong class="doc" style="color:#000000">Info</strong> science</a>
    
    <a rel="external nofollow" title="something" href="https://url2&amp;rl=h1" target="_blank">
    <strong class="image" style="color:#000000">Info</strong> bio</a>
    
    <a rel="external nofollow" title="something" href="https://url3&amp;rl=h1" target="_blank">
    <strong class="doc" style="color:#000000">Info</strong> science</a>
    
    <a rel="external nofollow" title="something" href="https://url4&amp;rl=h1" target="_blank">
    <strong class="image" style="color:#000000">Info</strong> bio</a>
    """
    soup = BS(html, 'lxml')
    
    for a in soup.find_all('a', href=True):
        if a.getText().strip() == 'Info science':
            print(a['href'])
    

    Output:

    https://url1&rl=h1
    https://url3&rl=h1
    
    Login or Signup to reply.
  2. Using and in command line in a shell (don’t use awk nor regex):

    xidel -e '//a[contains(., "Info science")]/@href' -s <FILE or URL>
    

    Required output

    https://url1&rl=h1
    https://url3&rl=h1
    

    Works for Linux, MacOs, Windows, BSD…

    Check https://github.com/benibela/xidel

    Login or Signup to reply.
  3. Use xmllint tool with specific xpath pattern:

    xmllint --html --xpath "//a[contains(., 'Info science')]/@href" source.html
    

     href="https://url1&amp;rl=h1"
     href="https://url3&amp;rl=h1"
    

    To extract raw urls pipe to sed substitution:

    xmllint --html --xpath "//a[contains(., 'Info science')]/@href" source.html 
    | sed 's/href=|"//g'
    

    https://url1&amp;rl=h1
    https://url3&rl=h1
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search