I have a Python script that imports a list of url’s from a CSV named list.csv, scrapes them and outputs any anchor text and href links found on each url from the csv:
(For reference the list of urls in the csv are all in column A)
from requests_html import HTMLSession
from urllib.request import urlopen
from bs4 import BeautifulSoup
import requests
import pandas
import csv
contents = []
with open('list.csv','r') as csvf: # Open file in read mode
urls = csv.reader(csvf)
for url in urls:
contents.append(url) # Add each url to list contents
for url in contents:
page = urlopen(url[0]).read()
soup = BeautifulSoup(page, "lxml")
for link in soup.find_all('a'):
if len(link.text)>0:
print(url, link.text, '-', link.get('href'))
The output results look something like this where https://www.example.com/csv-url-one/ and https://www.example.com/csv-url-two/ are the url’s in column A in the csv:
['https://www.example.com/csv-url-one/'] Creative - https://www.example.com/creative/
['https://www.example.com/csv-url-one/'] Web Design - https://www.example.com/web-design/
['https://www.example.com/csv-url-two/'] PPC - https://www.example.com/ppc/
['https://www.example.com/csv-url-two/'] SEO - https://www.example.com/seo/
The issue is i want the output results to look more like this i.e not repeatedly print the url in the CSV before each result AND have a break after each line from the CSV:
['https://www.example.com/csv-url-one/']
Creative - https://www.example.com/creative/
Web Design - https://www.example.com/web-design/
['https://www.example.com/csv-url-two/']
PPC - https://www.example.com/ppc/
SEO - https://www.example.com/seo/
Is this possible?
Thanks
3
Answers
It is possible.
Simply add
n
at the end ofprint
.n
is a break line special character.Does the following solve your problem?
To add a separation between urls add a
n
before printing each url.If you want to print the urls only if it has valid links ie
if len(link.text)>0:
, use the for loop to save valid links to a list, and only print url and links if this list is not empty.try this: