Html - Getting title from nested divs

ihatecoding
June 28, 2023
309 views
0 votes
2 Answers

I am very new to web scraping and want to scrape this website.
This is what I tried and didn’t know how to continue.

import requests
from bs4 import BeautifulSoup
url = "https://www.xbox.com/en-US/browse/games"
response = requests.get(url).text
soup = BeautifulSoup(response, 'lxml')
game_titles = soup.find('li')

I want the title, image url, genre and image url for each game. How do I do this?

Here is the html script.

Answers

- rabbibillclinton
- June 28, 2023 at 9:44 pm
- 0 votes
0
To get the info for each category, you need the class names. Looking at the site, here’s what I found.

Titles are spans with the class name of
ProductCard-module__singleLineTitle___32jUF typography-module__xdsBody2___RNdGY.

Images are imgs with the class name of
ProductCard-module__boxArt___-2vQY img-fluid.

Genres don’t seem to be listed.

Start off by initializing bs4.
```
import requests
from bs4 import BeautifulSoup

res= requests.get("https://www.xbox.com/en-US/browse/games")
soup = BeautifulSoup(res.text, 'html.parser')
```
Titles
```
title_class = "ProductCard-module__singleLineTitle___32jUF typography-module__xdsBody2___RNdGY"
titles = soup.find_all("span", {"class": title_class})

title_texts = []
for title in titles:
    title_texts.append(title.get_text())
```
Images
```
image_class = "ProductCard-module__boxArt___-2vQY img-fluid"
images = soup.find_all("img", {"class": image_class})

image_srcs = []
for image in images:
    image_srcs.append(image["src"])

print(title_texts, image_srcs)
```
The title_texts and image_srcs arrays will give you what you’re looking for.
Login or Signup to reply.

The games-list is loaded via Javascript so beautifulsoup doesn’t see it. To load the game title list and image URLs you can try this example:

import re
import json
import requests

url = "https://www.xbox.com/en-US/browse/games"
api_url = 'https://emerald.xboxservices.com/xboxcomfd/browse?locale=en-US'
html_doc = requests.get(url).text

data = re.search(r'window.__PRELOADED_STATE__ = (.*);', html_doc).group(1)
data = json.loads(data)

for p in data['core2']['products']['productSummaries'].values():
    print(p['title'], p['images']['poster']['url'])

headers = {'x-ms-api-version': '1.1', 'ms-cv': 'HDDOd0MJ0jZl5tfDfh/YpH.17', 'x-s2s-authorization': 'Bearer eyJ0eXAiOiJKV1QiLCJhbGciOiJSUzI1NiIsIng1dCI6Ii1LSTNROW5OUjdiUm9meG1lWm9YcWJIWkdldyIsImtpZCI6Ii1LSTNROW5OUjdiUm9meG1lWm9YcWJIWkdldyJ9.eyJhdWQiOiJhcGk6Ly9iZjRhMzdhMC04ZDU5LTQ4YzYtYjliZi04MGNiZjI5NGRlZTkiLCJpc3MiOiJodHRwczovL3N0cy53aW5kb3dzLm5ldC83MmY5ODhiZi04NmYxLTQxYWYtOTFhYi0yZDdjZDAxMWRiNDcvIiwiaWF0IjoxNjg3ODk5NDAwLCJuYmYiOjE2ODc4OTk0MDAsImV4cCI6MTY4Nzk4NjEwMCwiYWlvIjoiRTJaZ1lEQjR1WG9keStxRkRGc0UxUS9kN24xM0hBQT0iLCJhcHBpZCI6ImJmNGEzN2EwLThkNTktNDhjNi1iOWJmLTgwY2JmMjk0ZGVlOSIsImFwcGlkYWNyIjoiMSIsImlkcCI6Imh0dHBzOi8vc3RzLndpbmRvd3MubmV0LzcyZjk4OGJmLTg2ZjEtNDFhZi05MWFiLTJkN2NkMDExZGI0Ny8iLCJvaWQiOiI1MzQ3Mjk5ZC05ODk3LTQ0NjAtYjM4Yy03YWJlODg4MWRkNGMiLCJyaCI6IjAuQVJvQXY0ajVjdkdHcjBHUnF5MTgwQkhiUjZBM1NyOVpqY1pJdWItQXlfS1UzdWthQUFBLiIsInN1YiI6IjUzNDcyOTlkLTk4OTctNDQ2MC1iMzhjLTdhYmU4ODgxZGQ0YyIsInRpZCI6IjcyZjk4OGJmLTg2ZjEtNDFhZi05MWFiLTJkN2NkMDExZGI0NyIsInV0aSI6ImwtM3l3anFMSTBhbl9HX3Qzd2k5QUEiLCJ2ZXIiOiIxLjAifQ.Z7HsdumcHNIhvDR_NwvkfSQcWMk_xpeg37ylhggGIfiglFGzm9Am00R340NVRXrBg-7shvu1Fl9UbHE4ryksxEbKPn9pPr8MuPi6x_KRwU9hciQMXof2GNs6vGRyB-yoOcox326vWS_P2izmfasf9t34qZjC97Lv1hGml8K52m5AFNrobXYzk772gUXEWs4JhDRXniWPqfV_FYbISgcnjv6J149HMjqlGjZecgU_JIVT89h8DT3XoJOWO44hjX-kL-D0z-Y-ZIQiMrSrebLzqGoKaGwwPufPjG2WUYe-Sts7jzAmAvqa5OxmxOSP8-NZJMwPii3rqd-Oj6FmciPNNg'}
data = data['core2']['channels']['BROWSE_']['data']

for _ in range(1, 4):  # <-- increase number of pages here
    payload = {
        "ChannelKeyToBeUsedInResponse": "BROWSE_",
        "EncodedCT": data['encodedCT'],
        "Filters": "e30=",
        "ReturnFilters": False
    }

    data = requests.post(api_url, headers=headers, json=payload).json()

    for p in data['productSummaries']:
        print(p['title'], p['images']['poster']['url'])

    data = data['channels']['BROWSE_']

Prints:

AEW: Fight Forever Elite Edition - Pre-Order https://store-images.s-microsoft.com/image/apps.8819.13929383349878771.c97eae7d-03ad-4b3e-8912-14049595d572.99b67f06-8c73-4a57-9194-d2fb35cc4ed5
Call of Duty®: Warzone™ https://store-images.s-microsoft.com/image/apps.29984.13739535057760905.9506aae3-1290-433f-9d84-f3d91000412d.450f203f-99bf-4637-9918-fa5e599caf55
Fortnite https://store-images.s-microsoft.com/image/apps.23288.70702278257994163.3d53b09f-6089-475c-ac1f-443a287576e5.52db1592-4a73-4d72-a798-0e4097018581
Grand Theft Auto V https://store-images.s-microsoft.com/image/apps.32034.68565266983380288.0f5ef871-88c0-45f7-b108-6aacbc041fcf.9b094362-c51d-49e5-9e92-80710c585fca
Crash Team Rumble™ - Standard Edition https://store-images.s-microsoft.com/image/apps.28361.14522625350761556.dd54dba8-4eb6-4c4e-a3b9-be91510b6250.f6634e57-ec49-4e18-bcfa-d569077329df


...and so on.

Please signup or login to give your own answer.

Click here to cancel reply.

Html – Getting title from nested divs

Answers