I am working with the R programming language and trying to learn about how to use Selenium to interact with webpages.
For example, using Google Maps – I am trying to find the name, address and longitude/latitude of all Pizza shops around a certain area. As I understand, this would involve entering the location you are interested in, clicking the "nearby" button, entering what you are looking for (e.g. "pizza"), scrolling all the way to the bottom to make sure all pizza shops are loaded – and then copying the names, address and longitude/latitudes of all pizza locations.
I have been self-teaching myself how to use Selenium in R and have been able to solve parts of this problem myself. Here is what I have done so far:
Part 1: Searching for an address (e.g. Statue of Liberty, New York, USA) and returning a longitude/latitude :
library(RSelenium)
library(wdman)
library(netstat)
selenium()
seleium_object <- selenium(retcommand = T, check = F)
remote_driver <- rsDriver(browser = "chrome", chromever = "114.0.5735.90", verbose = F, port = free_port())
remDr<- remote_driver$client
remDr$navigate("https://www.google.com/maps")
search_box <- remDr$findElement(using = 'css selector', "#searchboxinput")
search_box$sendKeysToElement(list("Statue of Liberty", key = "enter"))
Sys.sleep(5)
url <- remDr$getCurrentUrl()[[1]]
long_lat <- gsub(".*@(-?[0-9.]+),(-?[0-9.]+),.*", "\1,\2", url)
long_lat <- unlist(strsplit(long_lat, ","))
> long_lat
[1] "40.7269409" "-74.0906116"
Part 2: Searching for all Pizza shops around a certain location:
library(RSelenium)
library(wdman)
library(netstat)
selenium()
seleium_object <- selenium(retcommand = T, check = F)
remote_driver <- rsDriver(browser = "chrome", chromever = "114.0.5735.90", verbose = F, port = free_port())
remDr<- remote_driver$client
remDr$navigate("https://www.google.com/maps")
Sys.sleep(5)
search_box <- remDr$findElement(using = 'css selector', "#searchboxinput")
search_box$sendKeysToElement(list("40.7256456,-74.0909442", key = "enter"))
Sys.sleep(5)
search_box <- remDr$findElement(using = 'css selector', "#searchboxinput")
search_box$clearElement()
search_box$sendKeysToElement(list("pizza", key = "enter"))
Sys.sleep(5)
But from here, I do not know how to proceed. I do not know how to scroll the page all the way to the bottom to view all such results that are available – and I do not know how to start extracting the names.
Doing some research (i.e. inspecting the HTML code), I made the following observations:
-
The name of a restaurant location can be found in the following tags:
<a class="hfpxzc" aria-label=
-
The address of a restaurant location be found in the following tags:
<div class="W4Efsd">
In the end, I would be looking for a result like this:
name address longitude latitude
1 pizza land 123 fake st, city, state, zip code 45.212 -75.123
Can someone please show me how to proceed?
Note: Seeing as more people likely use Selenium through Python – I am more than happy to learn how to solve this problem in Python and then try to convert the answer into R code.r
Thanks!
References:
- https://medium.com/python-point/python-crawling-restaurant-data-ab395d121247
- https://www.youtube.com/watch?v=GnpJujF9dBw
- https://www.youtube.com/watch?v=U1BrIPmhx10
UPDATE: Some further progress with addresses
remDr$navigate("https://www.google.com/maps")
Sys.sleep(5)
search_box <- remDr$findElement(using = 'css selector', "#searchboxinput")
search_box$sendKeysToElement(list("40.7256456,-74.0909442", key = "enter"))
Sys.sleep(5)
search_box <- remDr$findElement(using = 'css selector', "#searchboxinput")
search_box$clearElement()
search_box$sendKeysToElement(list("pizza", key = "enter"))
Sys.sleep(5)
address_elements <- remDr$findElements(using = 'css selector', '.W4Efsd')
addresses <- lapply(address_elements, function(x) x$getElementText()[[1]])
result <- data.frame(name = unlist(names), address = unlist(addresses))
2
Answers
That is already a good start. I can name a few things I did to proceed, but note I mainly worked with python.
For locating elements within the DOM tree I suggest using xpath. It has a humanreadable syntax and is quite easy to learn.
https://devhints.io/xpath
Here you can find an overview of all possibilities to locate elements and a linked testbench by "Whitebeam.org" to train.
Also helps understanding how to extract names.
It will look something like this:
Returns an object for the given xpath expression
In this object we need to reference the desired attribute, probably
.text()
I am not sure about the syntax in R
To scroll there is https://www.selenium.dev/documentation/webdriver/actions_api/wheel/ but it has no documentation for
R
Or you could use javascript for scrolling.
https://cran.r-project.org/web/packages/js/vignettes/intro.html
Helpful resources:
https://statsandr.com/blog/web-scraping-in-r/
https://betterdatascience.com/r-web-scraping/
https://scrapfly.io/blog/web-scraping-with-r/#http-clients-crul
I see that you updated your question to include a Python answer, so here’s how it’s done in Python. you can use the same method for R.
The page is lazy loaded which means, as you scroll the data is paginated and loaded.
So, what you need to do, is to keep finding the last HTML tag of the data which will therefore load more content.
Finding how more data is loaded
You need to find out how the data is loaded. Here’s what I did:
First, disable internet access for your browser in the Network calls (F12 -> Network -> Offline)
Then, scroll to the last loaded element, you will see a loading indicator (since there is load internet, it will just hang)
Now, here comes the important part, find out under what HTML tag this loading indicator is:
As you can see that element is under the
div.qjESne
CSS selector.Working with Selenium
You can call the javascript code
scrollIntoView()
function which will scroll a particular element into view within the browser’s viewport.Finding out when to break
To find out when to stop scrolling in order to load more data, we need to find out what element appears when theres no data.
If you scroll until there are no more results, you will see:
which is an element under the CSS selector
span.HlvSq
.Code example