I’m trying to list all urls from a webpage but only the ones in a specific html A tag.
for example the urls targeted are the ones in html A tags containing "Info science":
bunch of html before
<a rel="external nofollow" title="something" href="https://url1&rl=h1" target="_blank">
<strong class="doc" style="color:#000000">Info</strong> science</a>
<a rel="external nofollow" title="something" href="https://url2&rl=h1" target="_blank">
<strong class="image" style="color:#000000">Info</strong> bio</a>
<a rel="external nofollow" title="something" href="https://url3&rl=h1" target="_blank">
<strong class="doc" style="color:#000000">Info</strong> science</a>
<a rel="external nofollow" title="something" href="https://url4&rl=h1" target="_blank">
<strong class="image" style="color:#000000">Info</strong> bio</a>
bunch of html after
the end result should be:
list_links.txt
https://url1&rl=h1
https://url3&rl=h1
I tried in a linux terminal:
lynx -dump https://website_to_scrape | awk '/https:/{print $2}' > list_links.txt
but I of course I get all urls on the web page. Which is already a step forward.
I also tried
grep -r science source.html > list_links.txt
but I get a list of all html strong tags containing the word science.
My knowledge is limited but it would be great to have a solution in bash script or something I could do from the terminal.
I’m open for other solutions of course.
3
Answers
In Python…
Given the HTML data as a string then:
Output:
Using xidel and xpath in command line in a shell (don’t use
awk
nor regex):Required output
Works for Linux, MacOs, Windows, BSD…
Check https://github.com/benibela/xidel
Use
xmllint
tool with specific xpath pattern:To extract raw urls pipe to
sed
substitution: