skip to Main Content

I am very new to web scraping and want to scrape this website.
This is what I tried and didn’t know how to continue.

import requests
from bs4 import BeautifulSoup
url = "https://www.xbox.com/en-US/browse/games"
response = requests.get(url).text
soup = BeautifulSoup(response, 'lxml')
game_titles = soup.find('li')

I want the title, image url, genre and image url for each game. How do I do this?

Here is the html script.
the html script from the website

2

Answers


  1. To get the info for each category, you need the class names. Looking at the site, here’s what I found.

    Titles are spans with the class name of
    ProductCard-module__singleLineTitle___32jUF typography-module__xdsBody2___RNdGY.

    Images are imgs with the class name of
    ProductCard-module__boxArt___-2vQY img-fluid.

    Genres don’t seem to be listed.

    Start off by initializing bs4.

    import requests
    from bs4 import BeautifulSoup
    
    res= requests.get("https://www.xbox.com/en-US/browse/games")
    soup = BeautifulSoup(res.text, 'html.parser')
    

    Titles

    title_class = "ProductCard-module__singleLineTitle___32jUF typography-module__xdsBody2___RNdGY"
    titles = soup.find_all("span", {"class": title_class})
    
    title_texts = []
    for title in titles:
        title_texts.append(title.get_text())
    

    Images

    image_class = "ProductCard-module__boxArt___-2vQY img-fluid"
    images = soup.find_all("img", {"class": image_class})
    
    image_srcs = []
    for image in images:
        image_srcs.append(image["src"])
    
    print(title_texts, image_srcs)
    

    The title_texts and image_srcs arrays will give you what you’re looking for.

    Login or Signup to reply.
  2. The games-list is loaded via Javascript so beautifulsoup doesn’t see it. To load the game title list and image URLs you can try this example:

    import re
    import json
    import requests
    
    url = "https://www.xbox.com/en-US/browse/games"
    api_url = 'https://emerald.xboxservices.com/xboxcomfd/browse?locale=en-US'
    html_doc = requests.get(url).text
    
    data = re.search(r'window.__PRELOADED_STATE__ = (.*);', html_doc).group(1)
    data = json.loads(data)
    
    for p in data['core2']['products']['productSummaries'].values():
        print(p['title'], p['images']['poster']['url'])
    
    headers = {'x-ms-api-version': '1.1', 'ms-cv': 'HDDOd0MJ0jZl5tfDfh/YpH.17', 'x-s2s-authorization': 'Bearer eyJ0eXAiOiJKV1QiLCJhbGciOiJSUzI1NiIsIng1dCI6Ii1LSTNROW5OUjdiUm9meG1lWm9YcWJIWkdldyIsImtpZCI6Ii1LSTNROW5OUjdiUm9meG1lWm9YcWJIWkdldyJ9.eyJhdWQiOiJhcGk6Ly9iZjRhMzdhMC04ZDU5LTQ4YzYtYjliZi04MGNiZjI5NGRlZTkiLCJpc3MiOiJodHRwczovL3N0cy53aW5kb3dzLm5ldC83MmY5ODhiZi04NmYxLTQxYWYtOTFhYi0yZDdjZDAxMWRiNDcvIiwiaWF0IjoxNjg3ODk5NDAwLCJuYmYiOjE2ODc4OTk0MDAsImV4cCI6MTY4Nzk4NjEwMCwiYWlvIjoiRTJaZ1lEQjR1WG9keStxRkRGc0UxUS9kN24xM0hBQT0iLCJhcHBpZCI6ImJmNGEzN2EwLThkNTktNDhjNi1iOWJmLTgwY2JmMjk0ZGVlOSIsImFwcGlkYWNyIjoiMSIsImlkcCI6Imh0dHBzOi8vc3RzLndpbmRvd3MubmV0LzcyZjk4OGJmLTg2ZjEtNDFhZi05MWFiLTJkN2NkMDExZGI0Ny8iLCJvaWQiOiI1MzQ3Mjk5ZC05ODk3LTQ0NjAtYjM4Yy03YWJlODg4MWRkNGMiLCJyaCI6IjAuQVJvQXY0ajVjdkdHcjBHUnF5MTgwQkhiUjZBM1NyOVpqY1pJdWItQXlfS1UzdWthQUFBLiIsInN1YiI6IjUzNDcyOTlkLTk4OTctNDQ2MC1iMzhjLTdhYmU4ODgxZGQ0YyIsInRpZCI6IjcyZjk4OGJmLTg2ZjEtNDFhZi05MWFiLTJkN2NkMDExZGI0NyIsInV0aSI6ImwtM3l3anFMSTBhbl9HX3Qzd2k5QUEiLCJ2ZXIiOiIxLjAifQ.Z7HsdumcHNIhvDR_NwvkfSQcWMk_xpeg37ylhggGIfiglFGzm9Am00R340NVRXrBg-7shvu1Fl9UbHE4ryksxEbKPn9pPr8MuPi6x_KRwU9hciQMXof2GNs6vGRyB-yoOcox326vWS_P2izmfasf9t34qZjC97Lv1hGml8K52m5AFNrobXYzk772gUXEWs4JhDRXniWPqfV_FYbISgcnjv6J149HMjqlGjZecgU_JIVT89h8DT3XoJOWO44hjX-kL-D0z-Y-ZIQiMrSrebLzqGoKaGwwPufPjG2WUYe-Sts7jzAmAvqa5OxmxOSP8-NZJMwPii3rqd-Oj6FmciPNNg'}
    data = data['core2']['channels']['BROWSE_']['data']
    
    for _ in range(1, 4):  # <-- increase number of pages here
        payload = {
            "ChannelKeyToBeUsedInResponse": "BROWSE_",
            "EncodedCT": data['encodedCT'],
            "Filters": "e30=",
            "ReturnFilters": False
        }
    
        data = requests.post(api_url, headers=headers, json=payload).json()
    
        for p in data['productSummaries']:
            print(p['title'], p['images']['poster']['url'])
    
        data = data['channels']['BROWSE_']
    

    Prints:

    AEW: Fight Forever Elite Edition - Pre-Order https://store-images.s-microsoft.com/image/apps.8819.13929383349878771.c97eae7d-03ad-4b3e-8912-14049595d572.99b67f06-8c73-4a57-9194-d2fb35cc4ed5
    Call of Duty®: Warzone™ https://store-images.s-microsoft.com/image/apps.29984.13739535057760905.9506aae3-1290-433f-9d84-f3d91000412d.450f203f-99bf-4637-9918-fa5e599caf55
    Fortnite https://store-images.s-microsoft.com/image/apps.23288.70702278257994163.3d53b09f-6089-475c-ac1f-443a287576e5.52db1592-4a73-4d72-a798-0e4097018581
    Grand Theft Auto V https://store-images.s-microsoft.com/image/apps.32034.68565266983380288.0f5ef871-88c0-45f7-b108-6aacbc041fcf.9b094362-c51d-49e5-9e92-80710c585fca
    Crash Team Rumble™ - Standard Edition https://store-images.s-microsoft.com/image/apps.28361.14522625350761556.dd54dba8-4eb6-4c4e-a3b9-be91510b6250.f6634e57-ec49-4e18-bcfa-d569077329df
    
    
    ...and so on.
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search