I’m very much a beginner and after banging my head against the wall, am asking for any help at all with this. I want to scrape a list of urls but my for loop is only returning the first item on the list.
I have a list of urls, a function to scrape the json data into a dictionary, convert the dictionary to a dataframe and export to the csv. Everything is working except the for loop so that only the first url on the list gets scraped:
url_list_str = ['https://www.foodpanda.ph/restaurant/vh2d/sicilian-roast-legaspi-village',
'https://www.foodpanda.ph/restaurant/ns76/tokyo-milk-cheese-factory-greenbelt-5',
'https://www.foodpanda.ph/restaurant/hksd/paul-greenbelt-5']
for url in url_list_str:
url = url_list_str[0]
response = req.get(url, headers = headers)
pause(5)
html = BeautifulSoup(response.content, 'html.parser')
data = foodpanda_data(html)
restaurant_name = data['Name']
df = pd.DataFrame([data])
foodpanda()
is a function above the for loop which scrapes the json and turns it into a dictionary. Here’s a preview because it’s pretty long:
def foodpanda_data(html):
script_tag = html.find("script", {"data-testid": "restaurant-seo-schema"})
json_text = script_tag.string
json_dict = json.loads(json_text)
extracted_data = {}
keys_to_extract = ["Name", "streetAddress", "addressLocality", "postalCode", "latitude", "longitude", "url", "ratingValue", "ratingCount", "bestRating", "worstRating", "servesCuisine", "priceRange"]
for key in keys_to_extract:
if key.lower() == 'name':
extracted_data[key] = json_dict.get('name', '') #... etc.
return extracted_data
I also tried writing the for loop as:
for u in range(len(url_list_str)):
url = url_list_str[u]
but that didn’t work either. There must be something really obvious here that I’m not getting so thank you!
2
Answers
because in every iteration, you’re picking the first URL from the list here (url = url_list_str[0]). Simply remove it.
I guess, you’re trying to do something like this
output: