I’m web scraping sol eBay data using Rvest.
Recently, eBay has started injecting hidden text into the readable text – see the image and scraped data.
Here is a URL example – you may or may not get the interlaced text:
Example URL
XPath to a line item
//*[@id="srp-river-results"]/ul/li[1]/div/div[2]/div[2]/div/span[1]
XPath I use to get all lines
//*[@id="srp-river-results"]/ul/li/div/div[2]
I need the text from s-a4v02P
and that text aggregated by line item.
I get something like the this:
"So2ld Mar 8,D1 2021JUI" "KSold MQaUr2V 3,E 20R2KC1R"
and so on
Question is, How can I just get "Sold Mar 8, 2021" "Sold Mar 3, 2021"
and so on?
Code I’ve have so far:
readHTML <- url %>%
read_html()
Title <- readHTML %>%
html_nodes(xpath='//*[@id="srp-river-results"]/ul/li/div/div[2]/a/h3') %>%
html_text()
SoldDateTop <- readHTML %>%
html_nodes(xpath='//*[@id="srp-river-results"]/ul/li/div/div[2]') %>%
html_nodes("[class='s-item__title--tagblock ']") %>%
html_nodes("[class='POSITIVE']") %>%
# html_nodes("[class='s-8kjgi5']") %>% <-- this class name changes
html_text()
2
Answers
To determine which tags are shown and which are hidden, there is a “style” element on the page with the display/hidden keys.
Using rvest version 1.0.0
Similar approach but based on observation that the variable part of the class value is length 6 for visible classes so you can extract the appropriate visible class value from the css style instructions
Also, means you can probably ignore extracting the appropriate class and use a filter function of some sort based on length of class being 8 i.e.
html_nodes('.POSITIVE span') %>% html_attr('class') %>% map(nchar) == 8
. I will have a look at that later today.