I have data in a JSON file that I want to unnest into a data.table. The JSON is structured like this:
[
{
"version_id": "123456",
"data": [
{
"review_id": "1",
"rating": 5,
"review": "This app is great",
"date": "2024-09-01"
},
{
"review_id": "2",
"rating": 1,
"review": "This app is terrible",
"date": "2024-09-01"
}
]
},
{
"version_id": "789101",
"data": [
{
"review_id": "3",
"rating": 3,
"review": "This app is OK",
"date": "2024-09-01"
}
]
}
]
I can wrangle this into a data.table by doing a bunch of processing and then binding the results together, but I want to know if it’s possible to do this more simply and/or efficiently with the j
in data.table.
What I’m doing currently:
reviews <- jsonlite::read_json("reviews.json")
version_ids <- purrr::map_chr(reviews, "version_id")
review_data <- purrr::map(reviews, "data")
cbind(
data.table::data.table(
version_id = rep(version_ids, lengths(review_data))
),
lapply(
review_data,
function(d) {
data.table::data.table(
review_id = purrr::map_chr(d, "review_id"),
rating = purrr::map_int(d, "rating"),
review = purrr::map_chr(d, "review"),
date = purrr::map_chr(d, "date")
)
}
) |>
data.table::rbindlist()
)
#> version_id review_id rating review date
#> 1: 123456 1 5 This app is great 2024-09-01
#> 2: 123456 2 1 This app is terrible 2024-09-01
#> 3: 789101 3 3 This app is OK 2024-09-01
This is the right result, but what I’m hoping to do is something like:
data.table::data.table(
version_id = version_ids,
review_data = review_data
)[
rep(version_id, lengths(review_data)),
.(
review_id = purrr::map_chr(.SD["review_data"], "review_id"),
rating = purrr::map_int(.SD["review_data"], "rating"),
review = purrr::map_chr(.SD["review_data"], "review"),
date = purrr::map_chr(.SD["review_data"], "date")
),
by = version_id
]
But that gives me the error:
Error in `[.data.table`(data.table::data.table(version_id = version_ids, :
When i is a data.table (or character vector), the columns to join by must be specified using 'on=' argument (see ?data.table), by keying x (i.e. sorted, and, marked as sorted, see ?setkey), or by sharing column names between x and i (i.e., a natural join). Keyed joins might have further speed benefits on very large data due to x being sorted in RAM.
I get the same error if I omit the i
entirely. Can anyone help with getting this to work?
2
Answers
EDIT using
data.table
: Rather than trying to get the data table to recurse through itself, set up data table to be long initially, then you can use therbindlist()
function to explode the columns:Original answer using
tidyr
in case it’s helpful to others:If one was using
tidyr
rather than data.table, the following should work by unnesting the list in stages, first to obtain theversion_id
index usingunnest_wider()
, then to create a row for each review (usingunnest()
) and then by creating columns for the characteristics of each review (usingunnest_wider()
):The tibble could be converted to a data.table if necessary.
Quick and dirty data.table solution: