skip to Main Content

I currently have the following problem: I extracted some data via the crunchbase API, resulting in a big nested list of the following structure (there are many more nested lists on several instances included, I here only display the part of the structure currently relevant for me):

> str(x[[1]])
$ uuid         : chr "5f9957b0841251e6e439d757XXXXXX"
$ relationships: List of 27
..$ websites: List of 3
.. ..$ cardinality: chr "OneToMany"
.. ..$ items      :'data.frame':    4 obs. of  7 variables:
.. .. ..$ properties.website_type: chr [1:4] "homepage" "facebook" "twitter" "linkedin"
.. .. ..$ properties.url         : chr [1:4] "http://www.example.com" "https://www.facebook.com/example" "http://twitter.com/example" "http://www.linkedin.com/company/example"

Consider the following minimal example:

x <- list()
x[[1]] <- list(uuid = "123", 
           relationships = list(websites = list(items =  list(
                                                properties.website_type = c("homepage", "facebook", "twitter", "linkedin"), 
                                                properties.url = c("www.example1.com", "www.fbex1.com", "www.twitterex1.com", "www.linkedinex1.com") ) )  ) )
x[[2]] <- list(uuid = "987", 
           relationships = list(websites = list(items =  list(
             properties.website_type = c("homepage", "facebook", "twitter" ), 
             properties.url = c("www.example2.com", "www.fbex2.com", "www.twitterex2.com") ) )  ) )

Now, I would like to create a dataframe with the following column structure:

> x.df
uuid          web.url  web.facebook        web.twitter        web.linkedin
1  123 www.example1.com www.fbex1.com www.twitterex1.com www.linkedinex1.com
2  987 www.example2.com www.fbex2.com www.twitterex2.com                <NA>

Meaning: I would like to have every uuid (a unique firm identifier) in a single column, followed by the urls of the different platforms (fb, twitter…). I tried a lot of different things with a combination of lapply(), spread(), and row_bind(), yet didn’t manage to make anything work. Any help on that would be appreciated.

2

Answers


  1. dplyr approach could be

    library(dplyr)
    library(tidyr)
    
    #convert list to dataframe in long format
    df <- do.call(rbind, lapply(x, data.frame, stringsAsFactors = FALSE))
    
    #final result
    df1 <- df %>%
      spread(relationships.websites.items.properties.website_type, relationships.websites.items.properties.url)
    

    which gives

      uuid      facebook         homepage            linkedin            twitter
    1  123 www.fbex1.com www.example1.com www.linkedinex1.com www.twitterex1.com
    2  987 www.fbex2.com www.example2.com                <NA> www.twitterex2.com
    

    Sample data:

    x <- list(structure(list(uuid = "123", relationships = structure(list(
        websites = structure(list(items = structure(list(properties.website_type = c("homepage", 
        "facebook", "twitter", "linkedin"), properties.url = c("www.example1.com", 
        "www.fbex1.com", "www.twitterex1.com", "www.linkedinex1.com"
        )), .Names = c("properties.website_type", "properties.url"
        ))), .Names = "items")), .Names = "websites")), .Names = c("uuid", 
    "relationships")), structure(list(uuid = "987", relationships = structure(list(
        websites = structure(list(items = structure(list(properties.website_type = c("homepage", 
        "facebook", "twitter"), properties.url = c("www.example2.com", 
        "www.fbex2.com", "www.twitterex2.com")), .Names = c("properties.website_type", 
        "properties.url"))), .Names = "items")), .Names = "websites")), .Names = c("uuid", 
    "relationships")))
    

    Update: In order to fix below error

    Error in (function (…, row.names = NULL, check.rows = FALSE,
    check.names = TRUE, : arguments imply differing number of rows: 1,
    0

    you would need to remove corrupted elements from input data where website_type has one value but properties.url has NULL. Run this chunk of code as a pre-processing step before executing the main solution:

    idx <- which(sapply(x, function(k) is.null(k$relationships$websites$items$properties.url)))
    x <- x[-idx]
    

    Sample data to test this pre-processing step:

    x <- list(structure(list(uuid = "123", relationships = structure(list(
        websites = structure(list(items = structure(list(properties.website_type = c("homepage", 
        "facebook", "twitter", "linkedin"), properties.url = c("www.example1.com", 
        "www.fbex1.com", "www.twitterex1.com", "www.linkedinex1.com"
        )), .Names = c("properties.website_type", "properties.url"
        ))), .Names = "items")), .Names = "websites")), .Names = c("uuid", 
    "relationships")), structure(list(uuid = "987", relationships = structure(list(
        websites = structure(list(items = structure(list(properties.website_type = "homepage", 
            properties.url = NULL), .Names = c("properties.website_type", 
        "properties.url"))), .Names = "items")), .Names = "websites")), .Names = c("uuid", 
    "relationships")), structure(list(uuid = "345", relationships = structure(list(
        websites = structure(list(items = structure(list(properties.website_type = "homepage", 
            properties.url = NULL), .Names = c("properties.website_type", 
        "properties.url"))), .Names = "items")), .Names = "websites")), .Names = c("uuid", 
    "relationships")))
    
    Login or Signup to reply.
  2. I know this is a clunkier solution, but it helped me seeing the process step by step (running str (x_df) to see each result):

    library(tidyverse)
    
    # Using your example
    x <- list()
    x[[1]] <- list(uuid = "123",
                        relationships = list(websites = list(items =  list(
                            properties.website_type = c("homepage", "facebook", "twitter", "linkedin"),
                            properties.url = c("www.example1.com", "www.fbex1.com", "www.twitterex1.com", "www.linkedinex1.com") ) )  ) )
    x[[2]] <- list(uuid = "987",
                        relationships = list(websites = list(items =  list(
                            properties.website_type = c("homepage", "facebook", "twitter" ),
                            properties.url = c("www.example2.com", "www.fbex2.com", "www.twitterex2.com") ) )  ) )
    
     
    
    # --- Iterations of unnest:
    x_df <- x %>% tibble::as_tibble_col( .) %>%  
        tidyr::unnest_wider(col = "value")  %>% 
        tidyr::unnest_longer(col = "relationships")   %>%  
        tidyr::unnest_wider(col = "relationships")  %>%  
        tidyr::unnest_wider(col =  "items")  %>%  
        tidyr::unnest_longer(col = c("properties.website_type", "properties.url")) %>% 
    # --- Lastly, group by id: 
        group_by(uuid) %>% 
        tidyr::pivot_wider(data = ., 
                                 names_from = properties.website_type, 
                                 values_from = c("properties.url"))
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search