my for loop is not iterating through a list of urls, only executing for the first item - SEO

user1366487
April 10, 2023
285 views
0 votes
2 Answers

I’m very much a beginner and after banging my head against the wall, am asking for any help at all with this. I want to scrape a list of urls but my for loop is only returning the first item on the list.

I have a list of urls, a function to scrape the json data into a dictionary, convert the dictionary to a dataframe and export to the csv. Everything is working except the for loop so that only the first url on the list gets scraped:

url_list_str = ['https://www.foodpanda.ph/restaurant/vh2d/sicilian-roast-legaspi-village',
 'https://www.foodpanda.ph/restaurant/ns76/tokyo-milk-cheese-factory-greenbelt-5',
 'https://www.foodpanda.ph/restaurant/hksd/paul-greenbelt-5']

for url in url_list_str:
  url = url_list_str[0]
  response = req.get(url, headers = headers)
  pause(5)
  html = BeautifulSoup(response.content, 'html.parser')
  data = foodpanda_data(html)
  restaurant_name = data['Name']
  df = pd.DataFrame([data])

foodpanda() is a function above the for loop which scrapes the json and turns it into a dictionary. Here’s a preview because it’s pretty long:

def foodpanda_data(html):
  script_tag = html.find("script", {"data-testid": "restaurant-seo-schema"})
  json_text = script_tag.string
  json_dict = json.loads(json_text)
  
  extracted_data = {}
  keys_to_extract = ["Name", "streetAddress", "addressLocality", "postalCode", "latitude", "longitude", "url", "ratingValue", "ratingCount", "bestRating", "worstRating", "servesCuisine", "priceRange"]
  for key in keys_to_extract:
    if key.lower() == 'name':
      extracted_data[key] = json_dict.get('name', '') #... etc.

     return extracted_data

I also tried writing the for loop as:

for u in range(len(url_list_str)):
  url = url_list_str[u]

but that didn’t work either. There must be something really obvious here that I’m not getting so thank you!

Answers

because in every iteration, you’re picking the first URL from the list here (url = url_list_str[0]). Simply remove it.

url_list_str = ['https://www.foodpanda.ph/restaurant/vh2d/sicilian-roast- legaspi-village',
         'https://www.foodpanda.ph/restaurant/ns76/tokyo-milk-cheese-factory-greenbelt-5',
         'https://www.foodpanda.ph/restaurant/hksd/paul-greenbelt-5']

for url in url_list_str:
    response = req.get(url, headers = headers)
    pause(5)
    html = BeautifulSoup(response.content, 'html.parser')
    data = foodpanda_data(html)
    restaurant_name = data['Name']
    df = pd.DataFrame([data])

I guess, you’re trying to do something like this

import json
import time
import requests
import pandas as pd
from bs4 import BeautifulSoup
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)

def foodpanda_data(html):
    script_tag = html.find("script", {"data-testid": "restaurant-seo-schema"})
    json_text = script_tag.string
    json_dict = json.loads(json_text)
    extracted_data = {
        "name": json_dict['name'],
        "streetAddress": json_dict['address']['streetAddress'],
        "addressLocality": json_dict['address']['addressLocality'],
        "postalCode": json_dict['address']['postalCode'],
        "latitude": json_dict['geo']['latitude'],
        "longitude": json_dict['geo']['longitude'],
        "url": json_dict['url'],
        "ratingValue": json_dict['aggregateRating']['ratingValue'],
        "ratingCount": json_dict['aggregateRating']['ratingCount'],
        "bestRating": json_dict['aggregateRating']['bestRating'],
        "worstRating": json_dict['aggregateRating']['worstRating'],
        "servesCuisine": json_dict['servesCuisine'],
        "priceRange": json_dict['priceRange']
    }
    return extracted_data


url_list_str = ['https://www.foodpanda.ph/restaurant/vh2d/sicilian-roast- legaspi-village',
            'https://www.foodpanda.ph/restaurant/ns76/tokyo-milk-cheese-factory-greenbelt-5',
            'https://www.foodpanda.ph/restaurant/hksd/paul-greenbelt-5']

all_data = []
for url in url_list_str:
    headers = {
        "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/112.0.0.0 Safari/537.36"
              }
    response = requests.get(url, headers=headers)
    html = BeautifulSoup(response.content, 'html.parser')
    data = foodpanda_data(html)
    all_data.append(data)
    time.sleep(1)

df = pd.DataFrame(all_data)
print(df.head())

output:

                                     name                                      streetAddress addressLocality postalCode   latitude   longitude                                                url  ratingValue  ratingCount  bestRating  worstRating                           servesCuisine priceRange
0         Sicilian Roast - Legaspi Village  100 Don Carlos Palanca corner Dela Rosa Street...     Makati City       1229  14.556083  121.019540  https://www.foodpanda.ph/restaurant/vh2d/sicil...          4.4           29           5            1                 [Italian, Pizza, Pasta]         ₱₱
1  Tokyo Milk Cheese Factory - Greenbelt 5  2nd Floor Greenbelt 5 Legazpi Street Legazpi V...     Makati City       1229  14.553329  121.022054  https://www.foodpanda.ph/restaurant/ns76/tokyo...          5.0           58           5            1    [Desserts, Fast Food, Snacks, Cakes]        ₱₱₱
2                       PAUL - Greenbelt 5  Ground Floor Greenbelt 5 Legazpi Street Barang...     Makati City       1223  14.552704  121.020531  https://www.foodpanda.ph/restaurant/hksd/paul-...          4.7           12           5            1  [Sandwiches, American, Western, Bread]         ₱₱

Please signup or login to give your own answer.

Click here to cancel reply.

my for loop is not iterating through a list of urls, only executing for the first item – SEO

Answers