I’m trying to scrape data from this website, which has a table of game credits for different categories.
There are a total of 24 categories that I want to make in to 24 columns. In the example “https://www.mobygames.com/developer/sheet/view/developerId,1/” there are 5 (production, design, engineering, and thanks).
It would have been easy if rows in different sections have different class but they all have the same tr class: “devCreditsHighlight”. Different page has different sections, and depending on the page the order changes too. On top of that the information that I need starts in the next row of the table and the number of rows are random.
My question is
Is there a way to scrape a certain section of a table that includes certain keywords? for example start scraping when you encounter the text “Business” and then stop scraping when you encounter class=”5″
below is my code:
import bs4 as bs
import urllib.request
gameurl = "https://www.mobygames.com/developer/sheet/view/developerId,1"
req = urllib.request.Request(gameurl,headers={'User-Agent': 'Mozilla/5.0'})
sauce = urllib.request.urlopen(req).read()
soup = bs.BeautifulSoup(sauce,'lxml')
infopage = soup.find_all("div", {"class":"col-md-8 col-lg-8"})
core_list =[]
for credits in infopage:
niceHeaderTitle = credits.find_all("h1", {"class":"niceHeaderTitle"})
name = niceHeaderTitle[0].text
Titles = credits.find_all("h3", {"class":"clean"})
Titles = [title.get_text() for title in Titles]
if 'Business' in Titles:
businessinfo = credits.find_all("tr", {"class":"devCreditsHighlight"})
business = businessinfo[0].get_text(strip=True)
else:
business = 'none'
if 'Production' in Titles:
productioninfo = credits.find_all("tr", {"class":"devCreditsHighlight"})
production = productioninfo[0].get_text(strip=True)
else:
production = 'none'
if 'Design' in Titles:
designinfo = credits.find_all("tr", {"class":"devCreditsHighlight"})
design = designinfo[0].get_text(strip=True)
else:
design = 'none'
if 'Writers' in Titles:
writersinfo = credits.find_all("tr", {"class":"devCreditsHighlight"})
writers = writersinfo[0].get_text(strip=True)
else:
writers = 'none'
if 'Writers' in Titles:
writersinfo = credits.find_all("tr", {"class":"devCreditsHighlight"})
writers = writersinfo[0].get_text(strip=True)
else:
writers = 'none'
if 'Programming/Engineering' in Titles:
programinfo = credits.find_all("tr", {"class":"devCreditsHighlight"})
program = programinfo[0].get_text(strip=True)
else:
video = 'none'
if 'Video/Cinematics' in Titles:
videoinfo = credits.find_all("tr", {"class":"devCreditsHighlight"})
video = videoinfo[0].get_text(strip=True)
else:
video = 'none'
if 'Audio' in Titles:
Audioinfo = credits.find_all("tr", {"class":"devCreditsHighlight"})
audio = Audioinfo[0].get_text(strip=True)
else:
audio = 'none'
if 'Art/Graphics' in Titles:
artinfo = credits.find_all("tr", {"class":"devCreditsHighlight"})
art = artinfo[0].get_text(strip=True)
else:
art = 'none'
if 'Support' in Titles:
supportinfo = credits.find_all("tr", {"class":"devCreditsHighlight"})
support = supportinfo[0].get_text(strip=True)
else:
support = 'none'
if 'Thanks' in Titles:
thanksinfo = credits.find_all("tr", {"class":"devCreditsHighlight"})
thanks = thanksinfo[0].get_text(strip=True)
else:
thanks = 'none'
games=[name,business,production,design,writers,video,audio,art,support,program,thanks]
core_list.append(games)
print (core_list)
and here is the HTML of the webpage
</div>
</div></div>
<div class="row">
<div class="col-md-8 col-lg-8" style="overflow: hidden;">
<h1 class="niceHeaderTitle"><div class="pull-right"><a href="https://www.mobygames.com/developer/sheet/contribute/developerId,1/" class="btn btn-xs btn-mobysuccess">Contribute</a> </div>Brian Reynolds (I)</h1><ul class="nav nav-tabs" style="margin-bottom: 15px;"><li class="active"><a href="https://www.mobygames.com/developer/sheet/view/developerId,1/">Main</a></li><li><a href="https://www.mobygames.com/developer/brian-reynolds-i/credits/developerId,1/">Credits</a></li><li><a href="https://www.mobygames.com/developer/sheet/bio/developerId,1/">Biography</a></li><li><a href="https://www.mobygames.com/developer/shots/developerId,1/">Portraits</a></li></ul><h2 class="m5">Game Credits</h2>
<table class="devCreditsTable">
<tr><td colspan=5><h3 class="clean">Production</h3></td></tr>
<tr class="devCreditsHighlight">
<td><a href="https://www.mobygames.com/game/catan">Catan</a> (2007)</td><td class="devCreditsDivider"> </td><td style="white-space: nowrap;">(<span class="devCreditsTitle">Project Lead</span>)</td></tr>
<tr><td colspan=5> </td></tr>
<tr><td colspan=5><h3 class="clean">Design</h3></td></tr>
<tr class="devCreditsHighlight">
<td><a href="https://www.mobygames.com/game/rise-of-nations-extended-edition">Rise of Nations: Extended Edition</a> (2017)</td><td class="devCreditsDivider"> </td><td style="white-space: nowrap;">(<span class="devCreditsTitle">Design Lead</span>)</td></tr>
<tr class="devCreditsHighlight">
<td><a href="https://www.mobygames.com/game/dominations">DomiNations</a> (2015)</td><td class="devCreditsDivider"> </td><td style="white-space: nowrap;">(<span class="devCreditsTitle">Creative Director</span>)</td></tr>
<tr class="devCreditsHighlight">
<td><a href="https://www.mobygames.com/game/age-of-empires-iii-the-asian-dynasties">Age of Empires III: The Asian Dynasties</a> (2007)</td><td class="devCreditsDivider"> </td><td style="white-space: nowrap;">(<span class="devCreditsTitle">Creative Lead</span>)</td></tr>
<tr class="devCreditsHighlight">
<td><a href="https://www.mobygames.com/game/catan">Catan</a> (2007)</td><td class="devCreditsDivider"> </td><td style="white-space: nowrap;">(<span class="devCreditsTitle">Artificial Intelligence</span>)</td></tr>
<tr class="devCreditsHighlight">
<td><a href="https://www.mobygames.com/game/rise-of-nations-rise-of-legends">Rise of Nations: Rise of Legends</a> (2006)</td><td class="devCreditsDivider"> </td><td style="white-space: nowrap;">(<span class="devCreditsTitle">Design Lead</span>)</td></tr>
<tr class="devCreditsHighlight">
<td><a href="https://www.mobygames.com/game/rise-of-nations-gold-edition">Rise of Nations: Gold Edition</a> (2004)</td><td class="devCreditsDivider"> </td><td style="white-space: nowrap;">(<span class="devCreditsTitle">Design Lead</span>)</td></tr>
<tr class="devCreditsHighlight">
<td><a href="https://www.mobygames.com/game/rise-of-nations-thrones-patriots">Rise of Nations: Thrones & Patriots</a> (2004)</td><td class="devCreditsDivider"> </td><td style="white-space: nowrap;">(<span class="devCreditsTitle">Design Lead</span>)</td></tr>
<tr class="devCreditsHighlight">
<td><a href="https://www.mobygames.com/game/rise-of-nations">Rise of Nations</a> (2003)</td><td class="devCreditsDivider"> </td><td style="white-space: nowrap;">(<span class="devCreditsTitle">Game Design</span>)</td></tr>
<tr class="devCreditsHighlight">
<td><a href="https://www.mobygames.com/game/sid-meiers-alpha-centauri-planetary-pack">Sid Meier's Alpha Centauri: Planetary Pack</a> (2001)</td><td class="devCreditsDivider"> </td><td style="white-space: nowrap;">(<span class="devCreditsTitle">Created By</span>)</td></tr>
<tr class="devCreditsHighlight">
<td><a href="https://www.mobygames.com/game/civilization-ii-multiplayer-gold-edition">Civilization II: Multiplayer Gold Edition</a> (1999)</td><td class="devCreditsDivider"> </td><td style="white-space: nowrap;">(<span class="devCreditsTitle">Game Design</span>)</td></tr>
<tr class="devCreditsHighlight">
<td><a href="https://www.mobygames.com/game/sid-meiers-alien-crossfire">Sid Meier's Alien Crossfire</a> (1999)</td><td class="devCreditsDivider"> </td><td style="white-space: nowrap;">(<span class="devCreditsTitle">Design</span>)</td></tr>
<tr class="devCreditsHighlight">
<td><a href="https://www.mobygames.com/game/sid-meiers-alpha-centauri">Sid Meier's Alpha Centauri</a> (1999)</td><td class="devCreditsDivider"> </td><td style="white-space: nowrap;">(<span class="devCreditsTitle">Lead Design</span>)</td></tr>
<tr class="devCreditsHighlight">
<td><a href="https://www.mobygames.com/game/sid-meiers-gettysburg">Sid Meier's Gettysburg!</a> (1997)</td><td class="devCreditsDivider"> </td><td style="white-space: nowrap;">(<span class="devCreditsTitle">Design</span>)</td></tr>
<tr class="devCreditsHighlight">
<td><a href="https://www.mobygames.com/game/sid-meiers-civilization-ii">Sid Meier's Civilization II</a> (1996)</td><td class="devCreditsDivider"> </td><td style="white-space: nowrap;">(<span class="devCreditsTitle">Game Design</span>)</td></tr>
<tr class="devCreditsHighlight">
<td><a href="https://www.mobygames.com/game/sid-meiers-colonization">Sid Meier's Colonization</a> (1994)</td><td class="devCreditsDivider"> </td><td style="white-space: nowrap;">(<span class="devCreditsTitle">Game Design By</span>)</td></tr>
<tr class="devCreditsHighlight">
<td><a href="https://www.mobygames.com/game/return-of-the-phantom">Return of the Phantom</a> (1993)</td><td class="devCreditsDivider"> </td><td style="white-space: nowrap;">(<span class="devCreditsTitle">MADS (MicroProse Adventure Development System) Designed by</span>)</td></tr>
<tr class="devCreditsHighlight">
<td><a href="https://www.mobygames.com/game/super-quest">Super Quest</a> (1983)</td><td class="devCreditsDivider"> </td><td style="white-space: nowrap;">(<span class="devCreditsTitle">Original Game Idea ('Quest 1')</span>)</td></tr>
<tr><td colspan=5> </td></tr>
<tr><td colspan=5><h3 class="clean">Programming/Engineering</h3></td></tr>
<tr class="devCreditsHighlight">
<td><a href="https://www.mobygames.com/game/rise-of-nations-rise-of-legends">Rise of Nations: Rise of Legends</a> (2006)</td><td class="devCreditsDivider"> </td><td style="white-space: nowrap;">(<span class="devCreditsTitle">Project Lead</span>)</td></tr>
<tr class="devCreditsHighlight">
<td><a href="https://www.mobygames.com/game/sid-meiers-alpha-centauri-planetary-pack">Sid Meier's Alpha Centauri: Planetary Pack</a> (2001)</td><td class="devCreditsDivider"> </td><td style="white-space: nowrap;">(<span class="devCreditsTitle">Additional Programming</span>)</td></tr>
<tr class="devCreditsHighlight">
<td><a href="https://www.mobygames.com/game/civilization-ii-multiplayer-gold-edition">Civilization II: Multiplayer Gold Edition</a> (1999)</td><td class="devCreditsDivider"> </td><td style="white-space: nowrap;">(<span class="devCreditsTitle">Programming</span>)</td></tr>
<tr class="devCreditsHighlight">
<td><a href="https://www.mobygames.com/game/sid-meiers-alien-crossfire">Sid Meier's Alien Crossfire</a> (1999)</td><td class="devCreditsDivider"> </td><td style="white-space: nowrap;">(<span class="devCreditsTitle">Additional Programming</span>)</td></tr>
<tr class="devCreditsHighlight">
<td><a href="https://www.mobygames.com/game/sid-meiers-alpha-centauri">Sid Meier's Alpha Centauri</a> (1999)</td><td class="devCreditsDivider"> </td><td style="white-space: nowrap;">(<span class="devCreditsTitle">Programming</span>)</td></tr>
<tr class="devCreditsHighlight">
<td><a href="https://www.mobygames.com/game/sid-meiers-gettysburg">Sid Meier's Gettysburg!</a> (1997)</td><td class="devCreditsDivider"> </td><td style="white-space: nowrap;">(<span class="devCreditsTitle">Programming</span>)</td></tr>
<tr class="devCreditsHighlight">
<td><a href="https://www.mobygames.com/game/sid-meiers-civilization-ii">Sid Meier's Civilization II</a> (1996)</td><td class="devCreditsDivider"> </td><td style="white-space: nowrap;">(<span class="devCreditsTitle">Programming</span>)</td></tr>
<tr class="devCreditsHighlight">
<td><a href="https://www.mobygames.com/game/once-upon-a-forest">Once Upon a Forest</a> (1995)</td><td class="devCreditsDivider"> </td><td style="white-space: nowrap;">(<span class="devCreditsTitle">MADS Game Engine</span>)</td></tr>
<tr class="devCreditsHighlight">
<td><a href="https://www.mobygames.com/game/dragonsphere">Dragonsphere</a> (1994)</td><td class="devCreditsDivider"> </td><td style="white-space: nowrap;">(<span class="devCreditsTitle">Technical Director</span>)</td></tr>
<tr class="devCreditsHighlight">
<td><a href="https://www.mobygames.com/game/sid-meiers-colonization">Sid Meier's Colonization</a> (1994)</td><td class="devCreditsDivider"> </td><td style="white-space: nowrap;">(<span class="devCreditsTitle">Programming</span>)</td></tr>
<tr class="devCreditsHighlight">
<td><a href="https://www.mobygames.com/game/return-of-the-phantom">Return of the Phantom</a> (1993)</td><td class="devCreditsDivider"> </td><td style="white-space: nowrap;">(<span class="devCreditsTitle">Lead Programmer</span>)</td></tr>
<tr class="devCreditsHighlight">
<td><a href="https://www.mobygames.com/game/f-15-strike-eagle-iii">F-15 Strike Eagle III</a> (1992)</td><td class="devCreditsDivider"> </td><td style="white-space: nowrap;">(<span class="devCreditsTitle">Installation Programming</span>)</td></tr>
<tr class="devCreditsHighlight">
<td><a href="https://www.mobygames.com/game/rex-nebular-and-the-cosmic-gender-bender">Rex Nebular and the Cosmic Gender Bender</a> (1992)</td><td class="devCreditsDivider"> </td><td style="white-space: nowrap;">(<span class="devCreditsTitle">Programming</span>)</td></tr>
<tr class="devCreditsHighlight">
<td><a href="https://www.mobygames.com/game/quest-1">Quest 1</a> (1981)</td><td class="devCreditsDivider"> </td><td style="white-space: nowrap;">(<span class="devCreditsTitle">By</span>)</td></tr>
<tr><td colspan=5> </td></tr>
<tr><td colspan=5><h3 class="clean">Support</h3></td></tr>
<tr class="devCreditsHighlight">
<td><a href="https://www.mobygames.com/game/ville">The Ville</a> (2012)</td><td class="devCreditsDivider"> </td><td style="white-space: nowrap;">(<span class="devCreditsTitle">Additional Help</span>)</td></tr>
<tr><td colspan=5> </td></tr>
<tr><td colspan=5><h3 class="clean">Thanks</h3></td></tr>
<tr class="devCreditsHighlight">
<td><a href="https://www.mobygames.com/game/kingdoms-of-amalur-reckoning">Kingdoms of Amalur: Reckoning</a> (2012)</td><td class="devCreditsDivider"> </td><td style="white-space: nowrap;">(<span class="devCreditsTitle">Additional Thanks</span>)</td></tr>
<tr class="devCreditsHighlight">
<td><a href="https://www.mobygames.com/game/c-evo">C-evo</a> (1999)</td><td class="devCreditsDivider"> </td><td style="white-space: nowrap;">(<span class="devCreditsTitle">Game Design</span>)</td></tr>
<tr class="devCreditsHighlight">
<td><a href="https://www.mobygames.com/game/sid-meiers-civnet">Sid Meier's CivNet</a> (1995)</td><td class="devCreditsDivider"> </td><td style="white-space: nowrap;">(<span class="devCreditsTitle">Special Thanks To</span>)</td></tr>
<tr><td colspan=5> </td></tr>
</table>
Right now my results are:
[[‘Contribute Brian Reynolds (I)’, ‘none’, ‘Catan(2007)(Project Lead)’, ‘Catan(2007)(Project Lead)’, ‘none’, ‘none’, ‘none’, ‘none’, ‘Catan(2007)(Project Lead)’, ‘Catan(2007)(Project Lead)’, ‘Catan(2007)(Project Lead)’]]What I want is to get the content in each specific sections of the table. For example The first section production is correctly collected ‘Catan(2007)(Project Lead)’ but its just because it happened to be in the first item. Because all the others have the same class I am not sure how to collect other content. So it would be something like
[[‘Contribute Brian Reynolds (I)’, ‘none’, ‘Catan(2007)(Project Lead)’, ‘Catan(2007)(Project Lead)’, ‘none’, ‘none’, ‘none’, ‘none’, ‘Rise of Nations: Extended Edition (2017) (Design Lead)DomiNations (2015)(Creative Director)Age of Empires III: The Asian Dynasties (2007) (Creative Lead)Catan (2007)(Artificial Intelligence)Rise of Nations: Rise of Legends (2006)(Design Lead)Rise of Nations: Gold Edition (2004)(Design Lead)Rise of Nations: Thrones & Patriots (2004)(Design Lead)Rise of Nations (2003)(Game Design)Sid Meier’s Alpha Centauri: Planetary Pack (2001) (Created By)Civilization II: Multiplayer Gold Edition (1999)(Game Design)Sid Meier’s Alien Crossfire (1999)(Design)Sid Meier’s Alpha Centauri (1999)(Lead Design)Sid Meier’s ettysburg! (1997) (Design)Sid Meier’s Civilization II (1996)(Game Design)Sid Meier’s Colonization (1994) (Game Design By)Return of the Phantom (1993) (MADS (MicroProse Adventure Development System) Designed by)Super Quest (1983)(Original Game Idea (‘Quest 1′))’, ‘Rise of Nations: Rise of Legends (2006) (Project Lead)Sid Meier’s Alpha Centauri: Planetary Pack (2001) (Additional Programming)Civilization II: Multiplayer Gold Edition (1999) (Programming)Sid Meier’s Alien Crossfire (1999)(Additional Programming)Sid Meier’s Alpha Centauri 1999)(Programming)Sid Meier’s Gettysburg! (1997)(Programming)Sid Meier’s Civilization II (1996) (Programming)Once Upon a Forest (1995)(MADS Game Engine)Dragonsphere (1994)(Technical Director)Sid Meier’s Colonization (1994)(Programming)Return of the Phantom (1993)(Lead Programmer)F-15 Strike Eagle III (1992) (Installation Programming)Rex Nebular and the Cosmic Gender Bender (1992)(Programming)Quest 1 (1981) (By)’, ‘The Ville (2012) (Additional Help)’]]
2
Answers
I can see how this would be difficult, the format is quite hard to deal with because the title isn’t the parent of the section but I found a way around.
EDIT: now pulls all rows not just first
To get it into columns, because each person will have a different set of categories you should probably extract it like this, get all the other people added and then use some data frame manipulation (pivot) to make it into the 24 columns you’re looking for
You could iterate through the rows of the table and if you find the category then get the index of that element in the rows array. The index + 1 will be the tr with the name of the game in it. This example shows getting the category and it’s game in a loop.
It seems like the reason Production came as none was because in this case productionTitle is actually an array so you could have to loop through it to find if each element has ‘Production’ in it.