My previous Question (logging in to website using requests) generated some awesome answers and with that I was able to scrape a lot of sites. But the site I’m working on now is tricky. I don’t know if it’s a website bug or something done intentionally, but i cannot scrape it.
heres a part of my code.
import requests
import re
from lxml import html
from multiprocessing.dummy import Pool as ThreadPool
from fake_useragent import UserAgent
import time
import ctypes
global FileName
now = time.strftime('%d.%m.%Y_%H%M%S_')
FileName=str(now + "Scraped data.txt")
fileW = open(FileName, "w")
url = open('URL.txt', 'r').read().splitlines()
fileW.write("URL Name SKU Dimensions Availability MSRP NetPrice")
fileW.write(chr(10))
count=0
no_of_pools=14
r = requests.session()
payload = {
"email":"I cant give them out in public",
"password":"maybe I can share it privately if anyone can help me with it :)",
"redirect":"true"
}
rs = r.get("https://checkout.reginaandrew.com/store/checkout.ssp?fragment=login&is=login&lang=en_US&login=T#login-register")
rs = r.post("https://checkout.reginaandrew.com/store/checkout.ssp?fragment=login&is=login&lang=en_US&login=T#login-register",data=payload,headers={'Referer':"https://checkout.reginaandrew.com/store/my_account.ssp"})
rs = r.get("https://checkout.reginaandrew.com/store/my_account.ssp")
tree = html.fromstring(rs.content)
print(str(tree.xpath("//*[@id='site-header']/div[3]/nav/div[2]/div/div/a/@href")))
The problem is that even when i manually log in and open a product URL, by entering it in the address bar, the browser doesn’t recognize that it’s logged in.
The only way around that is clicking a link in the page you are redirected after logging in. Only then does the browser recognize it has logged in and i can open specific URLs and see all the information.
What obstacle I ran into is that the link changes. The print statement in the code
print(str(tree.xpath(“//*[@id=’site-header’]/div[3]/nav/div[2]/div/div/a/@href”)))
This should’ve extracted the link but it returns nothing.
any ideas?
EDIT (stripping out white space) rs.content is:
<!DOCTYPE html><html lang="en-US"><head><meta charset="utf-8">
<meta http-equiv="X-UA-Compatible" content="IE=edge">
<link rel="shortcut icon" href="https://checkout.reginaandrew.com/c.1283670/store/img/favicon.ico" />
<meta name="viewport" content="width=device-width, initial-scale=1.0, maximum-scale=1.0, user-scalable=no">
<title></title>
<!--[if !IE]><!-->
<link rel="stylesheet" href="https://checkout.reginaandrew.com/c.1283670/store/css/checkout.css?t=1484321730904">
<!--<![endif]-->
<!--[if lte IE 9]>
<link rel="stylesheet" href="https://checkout.reginaandrew.com/c.1283670/store/css_ie/checkout_2.css?t=1484321730904">
<link rel="stylesheet" href="https://checkout.reginaandrew.com/c.1283670/store/css_ie/checkout_1.css?t=1484321730904">
<link rel="stylesheet" href="https://checkout.reginaandrew.com/c.1283670/store/css_ie/checkout.css?t=1484321730904">
<![endif]-->
<!--[if lt IE 9]>
<script src="/c.1283670/store/javascript/html5shiv.min.js"></script>
<script src="/c.1283670/store/javascript/respond.min.js"></script>
<![endif]-->
<script>var SC=window.SC={ENVIRONMENT:{jsEnvironment:typeof nsglobal==='undefined'?'browser':'server'},isCrossOrigin:function(){return 'checkout.reginaandrew.com'!==document.location.hostname},isPageGenerator:function(){return typeof nsglobal!=='undefined'},getSessionInfo:function(key){var session=SC.SESSION||SC.DEFAULT_SESSION||{};return key?session[key]:session},getPublishedObject:function(key){return SC.ENVIRONMENT&&SC.ENVIRONMENT.published&&SC.ENVIRONMENT.published[key]?SC.ENVIRONMENT.published[key]:null}};function loadScript(data){'use strict';var element;if(data.url){element='<script src="'+data.url+'"></'+'script>'}else{element='<script>'+data.code+'</'+'script>'}if(data.seo_remove){document.write(element)}else{document.write('</div>'+element+'<div class="seo-remove">')}}
</script>
</head>
<body>
<noscript>
<div class="checkout-layout-no-javascript-msg">
<strong>Javascript is disabled on your browser.</strong><br>
To view this site, you must enable JavaScript or upgrade to a JavaScript-capable browser.
</div>
</noscript>
<div id="main" class="main"></div>
<script>loadScript({url: '/c.1283670/store/checkout.environment.ssp?lang=en_US&cur=USD&t=' + (new Date().getTime())});
</script>
<script>if (!~window.location.hash.indexOf('login-register') && !~window.location.hash.indexOf('forgot-password') && 'login-register'){window.location.hash = 'login-register';}
</script>
<script src="/c.1283670/store/javascript/checkout.js?t=1484321730904"> </script>
<script src="/cms/2/assets/js/postframe.js"></script>
<script src="/cms/2/cms.js"></script>
<script>SCM['SC.Checkout'].Configuration.currentTouchpoint = 'login';</script>
</body>
</html>
2
Answers
This is going to be quite tricky and you might want to use a more sophisticated tool like Selenium to be able to emulate a browser.
Otherwise, you will need to investigate what cookies or other type of authentication is required for you to log in to the site. Note all the cookies that are being passed behind the scenes — it’s not quite as simple as just entering in the username/password to be able to log in here. You can see what information is being passed by viewing the Network tab in your web browser.
Finally, if you are worried that Selenium might be ‘sluggish’ (it is — after all, it is doing the same thing a user would be doing when opening a browser and clicking things), then you can try something like CasperJS, though the learning curve to implement something with this is quite steeper than Selenium — you might want to try with Selenium first.
Scraping sites can be hard.
Some sites send you well-formed HTML and all you need to do is search within it to find data / links, whatever you need for scraping.
Some sites send you poorly-formed HTML. Browsers, over the years have become pretty excepting of “bad” HTML and do the best they can to interpret what the HTML is trying to do. The down-side is if you’re using a strict parser to decipher the HTML it may fail: you need something able to work with fuzzy data. Or, just brute force with regex. Your use of
xpath
only works if the resulting HTML creates a well-formed XML document.Some sites (more and more these days) send a bit of HTML, and javascript, and perhaps JSON, XML, whatever to the browser. The Browser then constructs the final HTML (the DOM) and displays it to the user. That’s what you have here.
You want to scrape the final DOM, but as that’s not what the site is sending you. So, you either need to scrape what they send (for example, you figure out that the link you want can be determined from the JSON they send
{books: [{title: "Graphs of Wrath", code: "a88kyyedkgH"}]}
==>example.com/catalog?id=a88kyyedkgH
. Or, you scrape through a browser (e.g. using Selenium), letting the browser do all the requests, build up the DOM and then you scrape the result. It’s slower, but it works.When it gets hard, consider: