Hey stack overflow legends,
I’m trying to extract a dataframe from a nested list within a nested list, and it feels like I’m stuck in some kind of Inception-like dream within a dream. These lists are just nesting inside each other indefinitely, it’s like a never-ending Russian doll situation.
I’m using the following Trademe API call (https://api.trademe.co.nz/v1/Categories/0153-.xml) and the data is structured somewhat like this:
{
"Category": {
"Name": "Fashion",
"Subcategories": [
{
"Name": "Women's Clothing",
"Number": "0153-0154-",
"Path": "Fashion/Women's Clothing",
"Subcategories": [
{
"Name": "Dresses",
"Number": "0153-0154-0155-",
"Path": "Fashion/Women's Clothing/Dresses",
"Subcategories": []
},
{
"Name": "Tops",
"Number": "0153-0154-0156-",
"Path": "Fashion/Women's Clothing/Tops",
"Subcategories": []
},
...
]
},
{
"Name": "Men's Clothing",
"Number": "0153-0157-",
"Path": "Fashion/Men's Clothing",
"Subcategories": [
{
"Name": "Shirts",
"Number": "0153-0157-0158-",
"Path": "Fashion/Men's Clothing/Shirts",
"Subcategories": []
},
{
"Name": "Pants",
"Number": "0153-0157-0159-",
"Path": "Fashion/Men's Clothing/Pants",
"Subcategories": []
},
...
]
},
...
]
}
}
I want to extract a dataframe with three columns: "Name", "Number", and "Path". The "Name" column should contain the names of all the subcategories, the "Number" column should contain their corresponding numbers, and the "Path" column should contain their full paths. I really only need the data from the lowest level.
I’ve tried using lapply() and sapply(), but I just keep getting lost in the nested lists. Any help would be greatly appreciated!
Thanks in advance,
[Your Name]
3
Answers
In the provided example we could do:
Output:
Update: If the subcategories is nested indefinitely (theoretically) I guess that the easiest would be to unlist it and make use of the name structure instead of doing recursions.
E.g.
Output:
Data:
json_string
being your JSON text:output:
Here is a solution using
rrapply()
in packagerrapply
. We prune theName
,Number
andPath
elements of all nodes containing aPath
variable (nb: the root node does not have aPath
variable). This avoids the need for regular expressions applied to the unlisted/collapsed list names.The second
rrapply()
could also be replaced by e.g.dplyr::bind_rows()
,do.call(rbind, ...)
ordata.table::rbindlist()
.