I have the following code
from bs4 import BeautifulSoup
import requests
URL = 'https://www.youtube.com/gaming/games'
response = requests.get(URL).text
soup = BeautifulSoup(response, 'html.parser')
elem = soup.find_all('a', class_ = 'yt-simple-endpoint focus-on-expand style-scope ytd-game-details-renderer')
print(elem)
I am trying to isolate all the individual games on https://www.youtube.com/gaming/games.
I would like to just get the game name and how many people are watching. My issue is that I just can’t find the right " ", class_ = '' "
combo.
I’ve tried the following:
soup.find_all
:
('a', class_ = 'yt-simple-endpoint focus-on-expand style-scope ytd-game-details-renderer')
('game', class_ = 'style-scope ytd-game-card-renderer')
(class_ = 'style-scope ytd-grid-renderer')
(id = 'items')
And many different variations.
If I just use find_all('div')
I get random data. I really think (id = 'items'
) is my solution, but aside from 'div'
I get the same response every time, a pair of brackets []
. I’ve also tried searching the individual div class objects I get in the results, but so far I’m getting the same []
results or random data that I don’t need.
If I use find instead of find_all (elem = soup.find(id='items'))
I get "None"
as a response.
I’m looking at the subscriber count, with an id of 'live-viewers-count'
, and it still prints []
.
What I’m looking at:
2
Answers
You can’t really do this because this page is loaded dynamically with javascript.
BeautifulSoup doesn’t run javascript.
See, when right-clicking in the page and selecting
show page source
, there is mostly just compiled javascript.To scrape youtube, I’d either use Selenium to run a headless web-browser, or Js2Py if you need performance.
… or simply use youtube APIs : https://developers.google.com/youtube/v3/docs ^_^’
Update
Here’s how to traverse the game data JSON elements.
First, narrow down to
game_data
, which is a list of JSON elements.Now iterate over the list. For each element, there’s a section of the data packet we’ll call
details
, which contains game name and views.Then use the paths I showed in my original answer to capture name and view count for each game.
Output
Original answer
All of the data you need is stored as JSON in one of the
<script>
tags, it’s just a pain to follow down the nested object to the fields you need. You can see it’s all there if you just look atsoup.body
.I had a few spare minutes just now, this should get you started – shows you how to get to the Game and Live Viewers count for the first game listed currently (‘Valorant’)
This is how you get to game name (you can iterate instead of indexing [0] to get all the games):
Output
And this is Viewer Count:
Output