I’m web scraping sol eBay data using Rvest.

Recently, eBay has started injecting hidden text into the readable text – see the image and scraped data.

Here is a URL example – you may or may not get the interlaced text:
Example URL

XPath to a line item


XPath I use to get all lines


I need the text from s-a4v02P and that text aggregated by line item.

I get something like the this:

"So2ld Mar 8,D1 2021JUI" "KSold MQaUr2V 3,E 20R2KC1R" and so on

Question is, How can I just get "Sold Mar 8, 2021" "Sold Mar 3, 2021" and so on?

Code I’ve have so far:

readHTML <- url %>%
    Title <- readHTML %>%
        html_nodes(xpath='//*[@id="srp-river-results"]/ul/li/div/div[2]/a/h3') %>%

     SoldDateTop <- readHTML %>%
         html_nodes(xpath='//*[@id="srp-river-results"]/ul/li/div/div[2]') %>%
         html_nodes("[class='s-item__title--tagblock ']") %>%
         html_nodes("[class='POSITIVE']") %>%
         # html_nodes("[class='s-8kjgi5']") %>% <-- this class name changes

Example of the "hidden" text



  1. To determine which tags are shown and which are hidden, there is a “style” element on the page with the display/hidden keys.

    Using rvest version 1.0.0

    page <-read_html(url)
    #find stype tags
    styles <- page %>% html_elements("style") %>% html_text2()
    #get the "display inline" key
    #Assuming it is always the first style element of the second style node 
    displayInline <- gsub("(.*?) \{.*", "\1", styles[2])
    #find nodes of span with both class and role specfied
    parent <-page %>% html_elements(xpath=".//span[@class='POSITIVE' and @role='text']") 
    #retrieve the dates
    sapply(parent, function(p) {p %>% html_elements(displayInline) %>% html_text() %>% paste(collapse = "")})
    [1] "Sold  Mar 8, 2021"  "Sold  Mar 3, 2021"  "Sold  Feb 27, 2021" "Sold  Feb 22, 2021" "Sold  Feb 20, 2021" "Sold  Feb 19, 2021" "Sold  Feb 5, 2021" 
    [8] "Sold  Feb 4, 2021"  "Sold  Feb 3, 2021"  "Sold  Jan 31, 2021" "Sold  Jan 27, 2021" "Sold  Jan 22, 2021" "Sold  Jan 10, 2021" "Sold  Jan 3, 2021" 
    [15] "Sold  Jan 1, 2021"  "Sold  Jan 1, 2021"  "Sold  Dec 30, 2020" "Sold  Dec 25, 2020" "Sold  Dec 22, 2020" "Sold  Dec 20, 2020" "Sold  Dec 11, 2020"
    [22] "Sold  Mar 3, 2021"  "Sold  Jan 27, 2021" "Sold  Dec 25, 2020"
  2. Similar approach but based on observation that the variable part of the class value is length 6 for visible classes so you can extract the appropriate visible class value from the css style instructions

    get_sold_date <- function(nodelist, visible_class){
      nodelist %>% 
        html_nodes(paste0('.POSITIVE span.', visible_class))  %>% 
        html_text() %>% 
          paste(collapse = '')
    get_visible_class <- function(node){
        stringr::str_extract(node, '(s-[a-z0-9]{6})')
    page <- read_html(',San%20Diego,Con)%20Carbonite%20-Walgreens%20-3.75%20-3/4%20-Connexions%20-Die%20-Lot%20-Topps%20-Sideshow%20-1/6%20-1/12%20-AFA%20-UKG%20-Custom%20-Signature%20-Lego%20-Funko%20-Pop&LH_Sold=1&LH_ItemCondition=3&_dmd=7&_ipg=200&LH_Complete=1&LH_PrefLoc=1')
    listings <- page %>% 
      html_nodes('#srp-river-results .s-item')
    visible_class <- get_visible_class(page %>% 
                                         html_node('style[type="text/css"]') %>% 
                                         html_text(trim = T))
    dates <- map(listings,  get_sold_date,  visible_class)

    Also, means you can probably ignore extracting the appropriate class and use a filter function of some sort based on length of class being 8 i.e. html_nodes('.POSITIVE span') %>% html_attr('class') %>% map(nchar) == 8. I will have a look at that later today.

