skip to Main Content

I have a Python script that imports a list of url’s from a CSV named list.csv, scrapes them and outputs any anchor text and href links found on each url from the csv:

(For reference the list of urls in the csv are all in column A)

from requests_html import HTMLSession
from urllib.request import urlopen
from bs4 import BeautifulSoup
import requests
import pandas
import csv

contents = []
with open('list.csv','r') as csvf: # Open file in read mode
    urls = csv.reader(csvf)
    for url in urls:
        contents.append(url) # Add each url to list contents
    

for url in contents: 
    page = urlopen(url[0]).read()
    soup = BeautifulSoup(page, "lxml")

    for link in soup.find_all('a'):
        if len(link.text)>0:
            print(url, link.text, '-', link.get('href'))

The output results look something like this where https://www.example.com/csv-url-one/ and https://www.example.com/csv-url-two/ are the url’s in column A in the csv:

['https://www.example.com/csv-url-one/'] Creative - https://www.example.com/creative/
['https://www.example.com/csv-url-one/'] Web Design - https://www.example.com/web-design/
['https://www.example.com/csv-url-two/'] PPC - https://www.example.com/ppc/
['https://www.example.com/csv-url-two/'] SEO - https://www.example.com/seo/

The issue is i want the output results to look more like this i.e not repeatedly print the url in the CSV before each result AND have a break after each line from the CSV:

['https://www.example.com/csv-url-one/'] 
Creative - https://www.example.com/creative/
Web Design - https://www.example.com/web-design/

['https://www.example.com/csv-url-two/'] 
PPC - https://www.example.com/ppc/
SEO - https://www.example.com/seo/

Is this possible?

Thanks

3

Answers


  1. It is possible.

    Simply add n at the end of print.
    n is a break line special character.

    for url in contents: 
        page = urlopen(url[0]).read()
        soup = BeautifulSoup(page, "lxml")
    
        for link in soup.find_all('a'):
            if len(link.text)>0:
                print(url, ('n'), link.text, '-', link.get('href'), ('n'),)
    
    Login or Signup to reply.
  2. Does the following solve your problem?

    for url in contents: 
        page = urlopen(url[0]).read()
        soup = BeautifulSoup(page, "lxml")
        print('n','********',', '.join(url),'********','n')
        for link in soup.find_all('a'):
            if len(link.text)>0:
                print(link.text, '-', link.get('href'))
    
    Login or Signup to reply.
  3. To add a separation between urls add a n before printing each url.

    If you want to print the urls only if it has valid links ieif len(link.text)>0:, use the for loop to save valid links to a list, and only print url and links if this list is not empty.

    try this:

    for url in contents: 
        page = urlopen(url[0]).read()
        soup = BeautifulSoup(page, "lxml")
        
        valid_links = []
        for link in soup.find_all('a'):
            if len(link.text)>0:
                valid_links .append(link.text)
    
        if len (valid_links ):
            print('n', url)
            for item in valid_links :
                print(item.text, '-', item.get('href')))
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search