Ebay API - Rvest Split Data by Class Name where the class names change

Jacksonsox
March 11, 2021
171 views
1 vote
2 Answers

I’m web scraping sol eBay data using Rvest.

Recently, eBay has started injecting hidden text into the readable text – see the image and scraped data.

Here is a URL example – you may or may not get the interlaced text:
Example URL

XPath to a line item

//*[@id="srp-river-results"]/ul/li[1]/div/div[2]/div[2]/div/span[1]

XPath I use to get all lines

//*[@id="srp-river-results"]/ul/li/div/div[2]

I need the text from s-a4v02P and that text aggregated by line item.

I get something like the this:

"So2ld Mar 8,D1 2021JUI" "KSold MQaUr2V 3,E 20R2KC1R" and so on

Question is, How can I just get "Sold Mar 8, 2021" "Sold Mar 3, 2021" and so on?

Code I’ve have so far:

readHTML <- url %>%
        read_html()
    
    Title <- readHTML %>%
        html_nodes(xpath='//*[@id="srp-river-results"]/ul/li/div/div[2]/a/h3') %>%
        html_text()

     SoldDateTop <- readHTML %>%
         html_nodes(xpath='//*[@id="srp-river-results"]/ul/li/div/div[2]') %>%
         html_nodes("[class='s-item__title--tagblock ']") %>%
         html_nodes("[class='POSITIVE']") %>%
         # html_nodes("[class='s-8kjgi5']") %>% <-- this class name changes
         html_text()

Tags: r rvest web-scraping

Answers

To determine which tags are shown and which are hidden, there is a “style” element on the page with the display/hidden keys.

Using rvest version 1.0.0

library(rvest)
page <-read_html(url)

#find stype tags
styles <- page %>% html_elements("style") %>% html_text2()

#get the "display inline" key
#Assuming it is always the first style element of the second style node 
displayInline <- gsub("(.*?) \{.*", "\1", styles[2])

#find nodes of span with both class and role specfied
parent <-page %>% html_elements(xpath=".//span[@class='POSITIVE' and @role='text']") 

#retrieve the dates
sapply(parent, function(p) {p %>% html_elements(displayInline) %>% html_text() %>% paste(collapse = "")})

[1] "Sold  Mar 8, 2021"  "Sold  Mar 3, 2021"  "Sold  Feb 27, 2021" "Sold  Feb 22, 2021" "Sold  Feb 20, 2021" "Sold  Feb 19, 2021" "Sold  Feb 5, 2021" 
[8] "Sold  Feb 4, 2021"  "Sold  Feb 3, 2021"  "Sold  Jan 31, 2021" "Sold  Jan 27, 2021" "Sold  Jan 22, 2021" "Sold  Jan 10, 2021" "Sold  Jan 3, 2021" 
[15] "Sold  Jan 1, 2021"  "Sold  Jan 1, 2021"  "Sold  Dec 30, 2020" "Sold  Dec 25, 2020" "Sold  Dec 22, 2020" "Sold  Dec 20, 2020" "Sold  Dec 11, 2020"
[22] "Sold  Mar 3, 2021"  "Sold  Jan 27, 2021" "Sold  Dec 25, 2020"

Similar approach but based on observation that the variable part of the class value is length 6 for visible classes so you can extract the appropriate visible class value from the css style instructions

library(rvest)
library(magrittr)
library(stringr)

get_sold_date <- function(nodelist, visible_class){
  nodelist %>% 
    html_nodes(paste0('.POSITIVE span.', visible_class))  %>% 
    html_text() %>% 
      paste(collapse = '')
}

get_visible_class <- function(node){
    stringr::str_extract(node, '(s-[a-z0-9]{6})')
}

page <- read_html('https://www.ebay.com/sch/i.html?_nkw=Star%20Wars%20Black%20Series%20%20-POTF%20-POTF2%20-POTFII%20-Vintage%20%20Boba%20Fett%20Han%20Solo%20(SDCC,San%20Diego,Con)%20Carbonite%20-Walgreens%20-3.75%20-3/4%20-Connexions%20-Die%20-Lot%20-Topps%20-Sideshow%20-1/6%20-1/12%20-AFA%20-UKG%20-Custom%20-Signature%20-Lego%20-Funko%20-Pop&LH_Sold=1&LH_ItemCondition=3&_dmd=7&_ipg=200&LH_Complete=1&LH_PrefLoc=1')
listings <- page %>% 
  html_nodes('#srp-river-results .s-item')

visible_class <- get_visible_class(page %>% 
                                     html_node('style[type="text/css"]') %>% 
                                     html_text(trim = T))

dates <- map(listings,  get_sold_date,  visible_class)

print(dates)

Also, means you can probably ignore extracting the appropriate class and use a filter function of some sort based on length of class being 8 i.e. html_nodes('.POSITIVE span') %>% html_attr('class') %>% map(nchar) == 8. I will have a look at that later today.

Please signup or login to give your own answer.

Click here to cancel reply.

Ebay API – Rvest Split Data by Class Name where the class names change

Answers