skip to Main Content

I’m web scraping sol eBay data using Rvest.

Recently, eBay has started injecting hidden text into the readable text – see the image and scraped data.

Here is a URL example – you may or may not get the interlaced text:
Example URL

XPath to a line item

//*[@id="srp-river-results"]/ul/li[1]/div/div[2]/div[2]/div/span[1]

XPath I use to get all lines

//*[@id="srp-river-results"]/ul/li/div/div[2]

I need the text from s-a4v02P and that text aggregated by line item.

I get something like the this:

"So2ld Mar 8,D1 2021JUI" "KSold MQaUr2V 3,E 20R2KC1R" and so on

Question is, How can I just get "Sold Mar 8, 2021" "Sold Mar 3, 2021" and so on?

Code I’ve have so far:

readHTML <- url %>%
        read_html()
    
    Title <- readHTML %>%
        html_nodes(xpath='//*[@id="srp-river-results"]/ul/li/div/div[2]/a/h3') %>%
        html_text()

     SoldDateTop <- readHTML %>%
         html_nodes(xpath='//*[@id="srp-river-results"]/ul/li/div/div[2]') %>%
         html_nodes("[class='s-item__title--tagblock ']") %>%
         html_nodes("[class='POSITIVE']") %>%
         # html_nodes("[class='s-8kjgi5']") %>% <-- this class name changes
         html_text()

Example of the "hidden" text

2

Answers


  1. To determine which tags are shown and which are hidden, there is a “style” element on the page with the display/hidden keys.

    Using rvest version 1.0.0

    library(rvest)
    page <-read_html(url)
    
    #find stype tags
    styles <- page %>% html_elements("style") %>% html_text2()
    
    #get the "display inline" key
    #Assuming it is always the first style element of the second style node 
    displayInline <- gsub("(.*?) \{.*", "\1", styles[2])
    
    #find nodes of span with both class and role specfied
    parent <-page %>% html_elements(xpath=".//span[@class='POSITIVE' and @role='text']") 
    
    #retrieve the dates
    sapply(parent, function(p) {p %>% html_elements(displayInline) %>% html_text() %>% paste(collapse = "")})
    
    [1] "Sold  Mar 8, 2021"  "Sold  Mar 3, 2021"  "Sold  Feb 27, 2021" "Sold  Feb 22, 2021" "Sold  Feb 20, 2021" "Sold  Feb 19, 2021" "Sold  Feb 5, 2021" 
    [8] "Sold  Feb 4, 2021"  "Sold  Feb 3, 2021"  "Sold  Jan 31, 2021" "Sold  Jan 27, 2021" "Sold  Jan 22, 2021" "Sold  Jan 10, 2021" "Sold  Jan 3, 2021" 
    [15] "Sold  Jan 1, 2021"  "Sold  Jan 1, 2021"  "Sold  Dec 30, 2020" "Sold  Dec 25, 2020" "Sold  Dec 22, 2020" "Sold  Dec 20, 2020" "Sold  Dec 11, 2020"
    [22] "Sold  Mar 3, 2021"  "Sold  Jan 27, 2021" "Sold  Dec 25, 2020"
    
    Login or Signup to reply.
  2. Similar approach but based on observation that the variable part of the class value is length 6 for visible classes so you can extract the appropriate visible class value from the css style instructions

    library(rvest)
    library(magrittr)
    library(stringr)
    
    get_sold_date <- function(nodelist, visible_class){
      nodelist %>% 
        html_nodes(paste0('.POSITIVE span.', visible_class))  %>% 
        html_text() %>% 
          paste(collapse = '')
    }
    
    get_visible_class <- function(node){
        stringr::str_extract(node, '(s-[a-z0-9]{6})')
    }
    
    page <- read_html('https://www.ebay.com/sch/i.html?_nkw=Star%20Wars%20Black%20Series%20%20-POTF%20-POTF2%20-POTFII%20-Vintage%20%20Boba%20Fett%20Han%20Solo%20(SDCC,San%20Diego,Con)%20Carbonite%20-Walgreens%20-3.75%20-3/4%20-Connexions%20-Die%20-Lot%20-Topps%20-Sideshow%20-1/6%20-1/12%20-AFA%20-UKG%20-Custom%20-Signature%20-Lego%20-Funko%20-Pop&LH_Sold=1&LH_ItemCondition=3&_dmd=7&_ipg=200&LH_Complete=1&LH_PrefLoc=1')
    listings <- page %>% 
      html_nodes('#srp-river-results .s-item')
    
    visible_class <- get_visible_class(page %>% 
                                         html_node('style[type="text/css"]') %>% 
                                         html_text(trim = T))
    
    dates <- map(listings,  get_sold_date,  visible_class)
    
    print(dates)
    

    Also, means you can probably ignore extracting the appropriate class and use a filter function of some sort based on length of class being 8 i.e. html_nodes('.POSITIVE span') %>% html_attr('class') %>% map(nchar) == 8. I will have a look at that later today.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search