Getting into web scraping with python iterating through html tags

Zer0chance_
July 30, 2023
252 views
0 votes
2 Answers

Wrote a Python script to scrape this website https://lumiwallet.com/assets/ for all of its asset listings. I’ve managed to get the name of the 1st coin "Bitcoin" but no ticker. There are 27 pages on the website with 40 assets per page my intention is I would like to scrape names & tickers for all assets on all 27 pages then turn it to a pandas data frame then to a csv. file with a column for names and a column for tickers. I think the solution would be to write a for loop that iterates through the name tags also looking to get the asset ticker but only getting the name, but because I’m an amateur I not sure where to put the for loop is that’s the solution or maybe that’s not the solution. I’ve included an image of my output.

”’

from urllib import response
from webbrowser import get
import requests
from bs4 import BeautifulSoup
import csv
from csv import writer
from csv import reader
from urllib.parse import urlparse
import pandas as pd
from urllib.parse import urlencode

API_KEY = '7bbcbb39-029f-4075-97bc-6b57b6e9e68b'

def get_scrapeops_url(url):
    payload = {'api_key': API_KEY, 'url': url}
    proxy_url = 'https://proxy.scrapeops.io/v1/?' + urlencode(payload)
    return proxy_url

r = requests.get(get_scrapeops_url('https://lumiwallet.com/assets/'))
response = r.text


#list to store scraped data
data = []

soup = BeautifulSoup(response,'html.parser')
result = soup.find('div',class_ = 'assets-list__items')



# parse through the website's html
name = soup.find('div',class_ = 'asset-item__name')
ticker = soup.find('div',class__ = 'asset-item__short-name')

#Store data in a dictionary using key value pairs
d = {'name':name.text if name else None,'ticker':ticker.text if ticker else None} 

data.append(d)

#convert to a pandas df
data_df = pd.DataFrame(data)

data_df.to_csv("coins_scrape_lumi.csv", index=False)

print(data_df)

”’

Answers

Why you are using "proxy_scapes", if you may parsing native url? You gets HTML.

In your code add second ‘_’, for get ticker

result = soup.find('div',class__ = 'assets-list__items')

For get all tasks, you may using find all and cycle

from urllib import response
from webbrowser import get
import requests
from bs4 import BeautifulSoup
import csv
from csv import writer
from csv import reader
from urllib.parse import urlparse
import pandas as pd
from urllib.parse import urlencode

API_KEY = '7bbcbb39-029f-4075-97bc-6b57b6e9e68b'

def get_scrapeops_url(url):
    payload = {'api_key': API_KEY, 'url': url}
    proxy_url = 'https://proxy.scrapeops.io/v1/?' + urlencode(payload)
    return proxy_url

print(get_scrapeops_url('https://lumiwallet.com/assets/'))
r = requests.get(get_scrapeops_url('https://lumiwallet.com/assets/'))
response = r.text


#list to store scraped data
data = []

soup = BeautifulSoup(response,'html.parser')
result = soup.find('div',class_ = 'assets-list__items')



# parse through the website's html
name = soup.find_all('div',class_ = 'asset-item__name')
ticker = soup.find_all('div',class_ = 'asset-item__short-name')

#Store data in a dictionary using key value pairs
for i in range(len(name)):

  d = {'name':name[i].text if name[i] else None,'ticker':ticker[i].text if ticker[i] else None} 

  data.append(d)
print(data)

#convert to a pandas df
data_df = pd.DataFrame(data)

data_df.to_csv("coins_scrape_lumi.csv", index=False)

print(data_df)

For getting other pages, you may simulate clicking, bs4 doesn’t have this function, and you need selenium, or tkinter. or other libraries.

- Chowlett2
- July 30, 2023 at 7:36 am
- 0 votes
0
Because each page of listings on this site is contained within the same URL, we can’t get each one by name directly, so I’m opting to use Selenium browser automation to handle the page clicking. This script will take you to the webpage, scrape the first page, then loop through a series of clicking and scraping 26 times. 27 total pages are scraped, and 26 clicks are made to the next page. A list of the text contents are concatenated, then string splitting is utilized to separate the listing names and the tickers, which are separate by a newline character "n". Finally, those are turned into a pandas dataframe which is exported to CSV, in "write" mode, so it will overwrite previous copies of that file whenever the script is executed.

The only thing you should need to do before it will work as intended, is to make sure you have selenium installed and updated, along with a version of chromedriver that matches your Google Chrome browser version. Then, you will need to replace text in this line, because my chromedriver path won’t work for you: driver = webdriver.Chrome(r"YOUR CHROMEDRIVER PATH HERE", options=options) And point the webdriver to the location of chromedriver on your machine.
```
import pandas as pd
from selenium import webdriver
from selenium.webdriver.common.by import By

url = 'https://lumiwallet.com/assets/'

options = webdriver.ChromeOptions()
options.add_argument('--disable-blink-features=AutomationControlled')

driver = webdriver.Chrome(r"YOUR CHROMEDRIVER PATH HERE", options=options)
driver.get(url)
driver.implicitly_wait(3)

def get_assets():
    assets = []
    driver.implicitly_wait(3)
    asset_list = driver.find_element(By.CLASS_NAME, "assets-list__items")
    page_assets = asset_list.find_elements(By.CLASS_NAME, "asset-item")

    for asset in page_assets:
        assets.append(asset.text)
    
    return assets

def click_next_page():
    next_page = driver.find_element(By.CSS_SELECTOR, "#__layout > section > div.app-container > div > div > div.assets__body > div.assets-list > div.assets-list__pagination > div > div:nth-child(3) > div.pagination__item.pagination__item--next")
    driver.implicitly_wait(3)
    driver.execute_script("arguments[0].click();", next_page)
    driver.implicitly_wait(3)

assets = get_assets()

for i in range(26):
    click_next_page()
    assets += get_assets()

listings = []
tickers = []

for asset in assets:
    new = asset.split("n")
    listings.append(new[0])
    tickers.append(new[1])
    
df = pd.DataFrame(list(zip(listings, tickers)),
                  columns = ['Listing', 'Ticker'])

df.to_csv('lumiwallet_listings.csv', mode='w', index=False)
```
Let me know if you have any questions about this. Cheers!
Login or Signup to reply.

Please signup or login to give your own answer.

Click here to cancel reply.