Question posted in Json
Our archive of expertly curated questions and answers provides insights and solutions to common problems related to this popular data interchange format. From parsing and manipulating JSON data to integrating it with various programming languages and web services, our archive has got you covered. Start exploring today and take your JSON skills to the next level

"Error: parse error: trailing garbage" : How do I get content of <script type="application/ld+json"> using R

chappe29
July 10, 2023
333 views
0 votes
2 Answers

I’m trying to extract some data for a little research project in real estate economics and would like to get the price, the lot area, the description, the location, etc., out of <script type="application/ld+json">. I know I can use CSS Selectors on this site. However, Hemnet aggregates offers from several websites and CSS Selectors differ depending on the site the offer comes from. I wuld thus like to know how to get data from <script type="application/ld+json"> because it might also be useful later.

I have already tried this :

library(rvest)
library(xml2)
library(jsonlite)
library(dplyr)

o_url <- "https://www.hemnet.se/bostad/tomt-lisselbo-falu-kommun-svartskar-1-17-14704536"
o_html <- read_html(o_url)

o_json <- html_nodes(o_html, "[type="application/ld+json"]") %>% html_text()
ldjson <- jsonlite::fromJSON(o_json)

However, I’m getting this error message:

Error: parse error: trailing garbage
          https://www.hemnet.se"   }   {     "@context": "http://schem
                     (right here) ------^

I think it might be a format problem but I don’t really know much about json. I’ve only started using APIs recently. Indeed, ‘o_json’ looks like this:

[1] ...                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            
[2] "n  {n    "@context": "http://schema.org/",n    "@type": "Product",n    "name": "Svartskär 1:17",n    "image": "https://bilder.hemnet.se/images/itemgallery_L/40/dd/40dde2279382da793a435e1d71879ebb.jpg",n    "description":  "Nu finns möjligheten att förvärva en tomt med sjöläge i Lisselbo! Markarbeten är utförda och kommunalt avlopp är betalt. Kommunalt vatten kostar ca 30000 kronor att koppla in och finns vid tomtgränsen. Varmt välkomna att besöka tomten. Vid frågor kontakta oss.",n      "offers": {n        "@type": "Offer",n        "priceCurrency": "SEK",n        "price": 950000,n        "priceValidUntil": "2020-09-14T13:32:20+0200",n        "availability": "http://schema.org/InStock",n        "validFrom": "2018-09-14T13:32:20+0200",n        "url": "https://www.hemnet.se/bostad/tomt-lisselbo-falu-kommun-svartskar-1-17-14704536"n      },n    "mpn": "14704536",n      "brand": "SkandiaMäklarna Falun"n  }n"
[3]...

I thought there was a problem with "n" and tried this:

library(stringr)
clean_json <- str_remove_all(o_json, "n")
ldjson <- jsonlite::fromJSON(clean_json)

However, I still get the same error message…

Thank you in advance for your help! If you also have any advice on my code (not only about the problem I’ve been facing), I would be happy to hear it!

Answers

Chosen as BEST ANSWER
- chappe29
- July 10, 2023 at 8:09 pm
- 0 votes
0
Ok, my bad. I think I've found a(nother) way to do it. First, you need to:

install.packages(jsonld)

For instance, if you want to get the price, use:
```
library(jsonld)
expanded <- jsonld_expan(o_json[2])
expa <- jsonlite::fromJSON(expanded)
o_price <- ((expa$`http://schema.org/offers`[[1]]$`http://schema.org/price`[[1]][1,1]
```
We now get 950000 as numeric! If you find a better way to do it, please tell me

(Edit)

As noted in the comments, the first element returned by the selector in not a valid JSON,

cat(o_json[1])

returns:

  {
    "@context": "http://schema.org",
    "@type": "WebSite",
    "name": "Hemnet",
    "url": "https://www.hemnet.se"
  }
  {
    "@context": "http://schema.org",
    "@type": "Organization",
    "url": "https://www.hemnet.se",
    "logo": "https://assets.hemnet.se/assets/images/hemnet-logo.svg"
  }

Though it apparently passes structured data / JSON-LD validation ( https://validator.schema.org/ ), 2nd block is just ignored.

And you probably should not call jsonlite::fromJSON(o_json), that is fromJSON() on a vector of multiple JSON strings. It’s not vectorized and somewhat surprisingly it does not complain nor use just the first value, but it seems to collapse the argument vector and fails again.
Simplified example:

o_json <- c('{"a" : 1}', '{"b" : 2}')
jsonlite::fromJSON(o_json)
#> Error: parse error: trailing garbage
#>                              {"a" : 1} {"b" : 2}
#>                      (right here) ------^

Extracting data from json-ld elements might look something like this:

library(rvest)
library(dplyr)
library(purrr)
library(tidyr)
library(jsonlite)

o_url <- "https://www.hemnet.se/bostad/tomt-lisselbo-falu-kommun-svartskar-1-17-14704536"
o_html <- read_html(o_url)

o_json <- html_elements(o_html, "[type="application/ld+json"]") %>% html_text()

# parse all but 1st JSON
p_json <- map(o_json[-1], parse_json)
# extract "@type" values to use as names for the list:
p_json <- set_names(p_json, map(p_json, "@type"))

# extract few random values from list:
p_json$Product$description
#> [1] "Nu finns möjligheten att förvärva en tomt med sjöläge i Lisselbo! Markarbeten är utförda och kommunalt avlopp är betalt. Kommunalt vatten kostar ca 30000 kronor att koppla in och finns vid tomtgränsen. Varmt välkomna att besöka tomten. Vid frågor kontakta oss."
p_json$Product$offers$price
#> [1] 950000

# turn into wide dataframe (wingle line, all fields as columns),
# pivot to longer for better overview
p_json %>% 
  as.data.frame() %>%
  select(!matches(".type$|.context$")) %>% 
  mutate(across(everything(), as.character)) %>% 
  pivot_longer(everything())
#> # A tibble: 16 × 2
#>    name                           value                                         
#>    <chr>                          <chr>                                         
#>  1 Product.name                   Svartskär 1:17                                
#>  2 Product.image                  https://bilder.hemnet.se/images/itemgallery_L…
#>  3 Product.description            Nu finns möjligheten att förvärva en tomt med…
#>  4 Product.offers.priceCurrency   SEK                                           
#>  5 Product.offers.price           950000                                        
#>  6 Product.offers.priceValidUntil 2020-09-14T13:32:20+0200                      
#>  7 Product.offers.availability    http://schema.org/InStock                     
#>  8 Product.offers.validFrom       2018-09-14T13:32:20+0200                      
#>  9 Product.offers.url             https://www.hemnet.se/bostad/tomt-lisselbo-fa…
#> 10 Product.mpn                    14704536                                      
#> 11 Product.brand                  SkandiaMäklarna Falun                         
#> 12 Place.address.streetAddress    Svartskär 1:17                                
#> 13 Place.address.addressLocality  Lisselbo, Falu kommun                         
#> 14 Place.address.addressRegion    Dalarnas län                                  
#> 15 Place.address.addressCountry   SE                                            
#> 16 Place.address.postalCode       79196

^{Created on 2023-07-10 with reprex v2.0.2}

Number of json-ld elements do not seem to be limited by 3, some pages included structured text entries for events too, for example.

Please signup or login to give your own answer.

Click here to cancel reply.