skip to Main Content

I’m very much a beginner and after banging my head against the wall, am asking for any help at all with this. I want to scrape a list of urls but my for loop is only returning the first item on the list.

I have a list of urls, a function to scrape the json data into a dictionary, convert the dictionary to a dataframe and export to the csv. Everything is working except the for loop so that only the first url on the list gets scraped:

url_list_str = ['https://www.foodpanda.ph/restaurant/vh2d/sicilian-roast-legaspi-village',
 'https://www.foodpanda.ph/restaurant/ns76/tokyo-milk-cheese-factory-greenbelt-5',
 'https://www.foodpanda.ph/restaurant/hksd/paul-greenbelt-5']

for url in url_list_str:
  url = url_list_str[0]
  response = req.get(url, headers = headers)
  pause(5)
  html = BeautifulSoup(response.content, 'html.parser')
  data = foodpanda_data(html)
  restaurant_name = data['Name']
  df = pd.DataFrame([data])

foodpanda() is a function above the for loop which scrapes the json and turns it into a dictionary. Here’s a preview because it’s pretty long:

def foodpanda_data(html):
  script_tag = html.find("script", {"data-testid": "restaurant-seo-schema"})
  json_text = script_tag.string
  json_dict = json.loads(json_text)
  
  extracted_data = {}
  keys_to_extract = ["Name", "streetAddress", "addressLocality", "postalCode", "latitude", "longitude", "url", "ratingValue", "ratingCount", "bestRating", "worstRating", "servesCuisine", "priceRange"]
  for key in keys_to_extract:
    if key.lower() == 'name':
      extracted_data[key] = json_dict.get('name', '') #... etc.

     return extracted_data

I also tried writing the for loop as:

for u in range(len(url_list_str)):
  url = url_list_str[u]

but that didn’t work either. There must be something really obvious here that I’m not getting so thank you!

2

Answers


  1. because in every iteration, you’re picking the first URL from the list here (url = url_list_str[0]). Simply remove it.

    url_list_str = ['https://www.foodpanda.ph/restaurant/vh2d/sicilian-roast- legaspi-village',
             'https://www.foodpanda.ph/restaurant/ns76/tokyo-milk-cheese-factory-greenbelt-5',
             'https://www.foodpanda.ph/restaurant/hksd/paul-greenbelt-5']
    
    for url in url_list_str:
        response = req.get(url, headers = headers)
        pause(5)
        html = BeautifulSoup(response.content, 'html.parser')
        data = foodpanda_data(html)
        restaurant_name = data['Name']
        df = pd.DataFrame([data])
    
    Login or Signup to reply.
  2. I guess, you’re trying to do something like this

    import json
    import time
    import requests
    import pandas as pd
    from bs4 import BeautifulSoup
    pd.set_option('display.max_rows', 500)
    pd.set_option('display.max_columns', 500)
    pd.set_option('display.width', 1000)
    
    def foodpanda_data(html):
        script_tag = html.find("script", {"data-testid": "restaurant-seo-schema"})
        json_text = script_tag.string
        json_dict = json.loads(json_text)
        extracted_data = {
            "name": json_dict['name'],
            "streetAddress": json_dict['address']['streetAddress'],
            "addressLocality": json_dict['address']['addressLocality'],
            "postalCode": json_dict['address']['postalCode'],
            "latitude": json_dict['geo']['latitude'],
            "longitude": json_dict['geo']['longitude'],
            "url": json_dict['url'],
            "ratingValue": json_dict['aggregateRating']['ratingValue'],
            "ratingCount": json_dict['aggregateRating']['ratingCount'],
            "bestRating": json_dict['aggregateRating']['bestRating'],
            "worstRating": json_dict['aggregateRating']['worstRating'],
            "servesCuisine": json_dict['servesCuisine'],
            "priceRange": json_dict['priceRange']
        }
        return extracted_data
    
    
    url_list_str = ['https://www.foodpanda.ph/restaurant/vh2d/sicilian-roast- legaspi-village',
                'https://www.foodpanda.ph/restaurant/ns76/tokyo-milk-cheese-factory-greenbelt-5',
                'https://www.foodpanda.ph/restaurant/hksd/paul-greenbelt-5']
    
    all_data = []
    for url in url_list_str:
        headers = {
            "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/112.0.0.0 Safari/537.36"
                  }
        response = requests.get(url, headers=headers)
        html = BeautifulSoup(response.content, 'html.parser')
        data = foodpanda_data(html)
        all_data.append(data)
        time.sleep(1)
    
    df = pd.DataFrame(all_data)
    print(df.head())
    

    output:

                                         name                                      streetAddress addressLocality postalCode   latitude   longitude                                                url  ratingValue  ratingCount  bestRating  worstRating                           servesCuisine priceRange
    0         Sicilian Roast - Legaspi Village  100 Don Carlos Palanca corner Dela Rosa Street...     Makati City       1229  14.556083  121.019540  https://www.foodpanda.ph/restaurant/vh2d/sicil...          4.4           29           5            1                 [Italian, Pizza, Pasta]         ₱₱
    1  Tokyo Milk Cheese Factory - Greenbelt 5  2nd Floor Greenbelt 5 Legazpi Street Legazpi V...     Makati City       1229  14.553329  121.022054  https://www.foodpanda.ph/restaurant/ns76/tokyo...          5.0           58           5            1    [Desserts, Fast Food, Snacks, Cakes]        ₱₱₱
    2                       PAUL - Greenbelt 5  Ground Floor Greenbelt 5 Legazpi Street Barang...     Makati City       1223  14.552704  121.020531  https://www.foodpanda.ph/restaurant/hksd/paul-...          4.7           12           5            1  [Sandwiches, American, Western, Bread]         ₱₱
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search