skip to Main Content

Hey stack overflow legends,

I’m trying to extract a dataframe from a nested list within a nested list, and it feels like I’m stuck in some kind of Inception-like dream within a dream. These lists are just nesting inside each other indefinitely, it’s like a never-ending Russian doll situation.

I’m using the following Trademe API call (https://api.trademe.co.nz/v1/Categories/0153-.xml) and the data is structured somewhat like this:

{
  "Category": {
    "Name": "Fashion",
    "Subcategories": [
      {
        "Name": "Women's Clothing",
        "Number": "0153-0154-",
        "Path": "Fashion/Women's Clothing",
        "Subcategories": [
          {
            "Name": "Dresses",
            "Number": "0153-0154-0155-",
            "Path": "Fashion/Women's Clothing/Dresses",
            "Subcategories": []
          },
          {
            "Name": "Tops",
            "Number": "0153-0154-0156-",
            "Path": "Fashion/Women's Clothing/Tops",
            "Subcategories": []
          },
          ...
        ]
      },
      {
        "Name": "Men's Clothing",
        "Number": "0153-0157-",
        "Path": "Fashion/Men's Clothing",
        "Subcategories": [
          {
            "Name": "Shirts",
            "Number": "0153-0157-0158-",
            "Path": "Fashion/Men's Clothing/Shirts",
            "Subcategories": []
          },
          {
            "Name": "Pants",
            "Number": "0153-0157-0159-",
            "Path": "Fashion/Men's Clothing/Pants",
            "Subcategories": []
          },
          ...
        ]
      },
      ...
    ]
  }
}

I want to extract a dataframe with three columns: "Name", "Number", and "Path". The "Name" column should contain the names of all the subcategories, the "Number" column should contain their corresponding numbers, and the "Path" column should contain their full paths. I really only need the data from the lowest level.

I’ve tried using lapply() and sapply(), but I just keep getting lost in the nested lists. Any help would be greatly appreciated!

Thanks in advance,
[Your Name]

3

Answers


  1. In the provided example we could do:

    library(jsonlite)
    library(purrr)
    library(dplyr)
    
    jsontext |>
      fromJSON() |>
      pluck("Category", "Subcategories", "Subcategories") |>
      bind_rows() |>
      select(-Subcategories)
    

    Output:

         Name          Number                            Path
    1 Dresses 0153-0154-0155- Fashion/Womens Clothing/Dresses
    2    Tops 0153-0154-0156-    Fashion/Womens Clothing/Tops
    3  Shirts 0153-0157-0158-    Fashion/Mens Clothing/Shirts
    4   Pants 0153-0157-0159-     Fashion/Mens Clothing/Pants
    

    Update: If the subcategories is nested indefinitely (theoretically) I guess that the easiest would be to unlist it and make use of the name structure instead of doing recursions.

    E.g.

    jsonunlisted <- 
      jsontext |>
      fromJSON() |>
      pluck("Category", "Subcategories") |>
      unlist() 
    
    cbind(Name = jsonunlisted[grep("Name", names(jsonunlisted))],
          Number = jsonunlisted[grep("Number", names(jsonunlisted))],
          Path = jsonunlisted[grep("Path", names(jsonunlisted))])
    

    Output:

                                      Name              Number                 Path                               
    Name1                             "Womens Clothing" "0153-0154-"           "Fashion/Womens Clothing"          
    Name2                             "Mens Clothing"   "0153-0157-"           "Fashion/Mens Clothing"            
    Subcategories.Name1               "Dresses"         "0153-0154-0155-"      "Fashion/Womens Clothing/Dresses"  
    Subcategories.Name2               "Tops"            "0153-0154-0156-"      "Fashion/Womens Clothing/Tops"     
    Subcategories.Name1               "Shirts"          "0153-0157-0158-"      "Fashion/Mens Clothing/Shirts"     
    Subcategories.Name2               "Pants"           "0153-0157-0159-"      "Fashion/Mens Clothing/Pants"      
    Subcategories.Subcategories.Name1 "More Shirts"     "More 0153-0157-0158-" "More Fashion/Mens Clothing/Shirts"
    Subcategories.Subcategories.Name2 "More Pants"      "More 0153-0157-0159-" "More Fashion/Mens Clothing/Pants" 
    

    Data:

    jsontext <- '{
      "Category": {
        "Name": "Fashion",
        "Subcategories": [
          {
            "Name": "Womens Clothing",
            "Number": "0153-0154-",
            "Path": "Fashion/Womens Clothing",
            "Subcategories": [
              {
                "Name": "Dresses",
                "Number": "0153-0154-0155-",
                "Path": "Fashion/Womens Clothing/Dresses",
                "Subcategories": []
              },
              {
                "Name": "Tops",
                "Number": "0153-0154-0156-",
                "Path": "Fashion/Womens Clothing/Tops",
                "Subcategories": []
              }
            ]
          },
          {
            "Name": "Mens Clothing",
            "Number": "0153-0157-",
            "Path": "Fashion/Mens Clothing",
            "Subcategories": [
              {
                "Name": "Shirts",
                "Number": "0153-0157-0158-",
                "Path": "Fashion/Mens Clothing/Shirts",
                "Subcategories": []
              },
              {
                "Name": "Pants",
                "Number": "0153-0157-0159-",
                "Path": "Fashion/Mens Clothing/Pants",
                "Subcategories": [          {
                "Name": "More Shirts",
                "Number": "More 0153-0157-0158-",
                "Path": "More Fashion/Mens Clothing/Shirts",
                "Subcategories": []
              },
              {
                "Name": "More Pants",
                "Number": "More 0153-0157-0159-",
                "Path": "More Fashion/Mens Clothing/Pants",
                "Subcategories": []
              }]
              }
            ]
          }
        ]
      }
    }'
    
    Login or Signup to reply.
  2. json_string being your JSON text:

    library(jsonlite)
    fromJSON(json_string, simplifyDataFrame = TRUE)$Category$Subcategories[, -4]
    

    output:

                 Name     Number                    Path
    1 Womens Clothing 0153-0154- Fashion/Womens Clothing
    2   Mens Clothing 0153-0157-   Fashion/Mens Clothing
    
    Login or Signup to reply.
  3. Here is a solution using rrapply() in package rrapply. We prune the Name, Number and Path elements of all nodes containing a Path variable (nb: the root node does not have a Path variable). This avoids the need for regular expressions applied to the unlisted/collapsed list names.

    library(jsonlite)
    library(rrapply)
    
    jsontext |>
      fromJSON(simplifyVector = FALSE) |>
      rrapply(condition = (x, .xname) .xname == "Path", f = (x, .xsiblings) .xsiblings[c("Name", "Number", "Path")], how = "flatten") |>
      rrapply(how = "bind")
    
    #>              Name               Number                              Path
    #> 1 Womens Clothing           0153-0154-           Fashion/Womens Clothing
    #> 2         Dresses      0153-0154-0155-   Fashion/Womens Clothing/Dresses
    #> 3            Tops      0153-0154-0156-      Fashion/Womens Clothing/Tops
    #> 4   Mens Clothing           0153-0157-             Fashion/Mens Clothing
    #> 5          Shirts      0153-0157-0158-      Fashion/Mens Clothing/Shirts
    #> 6           Pants      0153-0157-0159-       Fashion/Mens Clothing/Pants
    #> 7     More Shirts More 0153-0157-0158- More Fashion/Mens Clothing/Shirts
    #> 8      More Pants More 0153-0157-0159-  More Fashion/Mens Clothing/Pants
    

    The second rrapply() could also be replaced by e.g. dplyr::bind_rows(), do.call(rbind, ...) or data.table::rbindlist().

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search