skip to Main Content

I have a large data set with several fields along with its value separated by space.
Then these fields are combined to make a single record and each record can have children of variable length Indented with a tab.

content of the file looks something like this :

company Samsung
type private
based South Korea

    company Harman International
    type private
    based United States
    industry Electronics

        company JBL
        type subsidiary
        based United States
        industry Audio

company Amazaon
type public
based United States
industry Cloud computing, e-commerce, artificial intelligence, consumer electronics

I want to store these records while maintaining the hierarchical structure and with an option to do quick search and way to access every record.

So far I came up with this approach :

# reading file from the source
path <- "/path/to/file.txt"
content <- readLines(path, warn = F)


# replaces , with ; so it does not translate it as a separator in next step
content <- gsub(",", ";", content)

# creating list of fields and value
contentList <- read.csv(text=sub(" ", ",", content), header=FALSE)

# replacing ; with , to revert data in right format
contentList$V2 <- gsub(";", ",", contentList$V2)

After above step contentList look like this :

contentList output

In the next step, I thought of using a function that would create a list with these rules:

  1. if the field does not have any t add it to the list(as named vector)
  2. if the field have one or more t make it a sub-list(as named vector) of previous record

But don’t know how this could be implemented in R.

How should I implement this?

Or Is there a better way to solve this problem that performs searching and accessing values quickly?

2

Answers


  1. RAW DATA IN

    raw <- read_lines("company Samsung
    type private
    based South Korea
    
        company Harman International
        type private
        based United States
        industry Electronics
    
            company JBL
            type subsidiary
            based United States
            industry Audio
    
    company Amazaon
    type public
    based United States
    industry Cloud computing, e-commerce, artificial intelligence, consumer electronics")
    

    PUT IN A TIBBLE AND GET THE INDENTUREMENT

    library(tidyverse)
    rawDf <- tibble(RAW = raw)
    
    companyIndenture <- rawDf %>% 
        filter(str_detect(RAW, "^\s*company")) %>% 
        mutate(LVL = case_when(
          str_detect(RAW, "^\s{8}") ~ 3,
          str_detect(RAW, "^\s{4}") ~ 2,
          TRUE ~ 1),
          COMPANY = str_replace(RAW, "^\s*company\s", "")) %>% 
        select(-RAW)
    # Gives us    
    # A tibble: 4 x 2
    # LVL COMPANY             
    # <dbl> <chr>               
    # 1     1 Samsung             
    # 2     2 Harman International
    # 3     3 JBL                 
    # 4     1 Amazaon             
    

    CLEAN WHITESPACE

    Now that we know what LVL each company is, lets get rid of some whitespace

    nextly <- rawDf %>% 
      mutate(RAW = str_replace(RAW, "^\s*", "")) %>% 
      filter(RAW != "") %>% 
      separate(RAW, c("ATTR", "VALUE"), sep = " ", extra = "merge") %>% 
        # And bring the LVL back in
      left_join(companyIndenture, by = c("VALUE" = "COMPANY")) %>% 
      select(LVL, ATTR, VALUE)
    
    # A tibble: 15 x 3
    # LVL ATTR     VALUE                                                                     
    # <dbl> <chr>    <chr>                                                                     
    # 1     1 company  Samsung                                                                   
    # 2    NA type     private                                                                   
    # 3    NA based    South Korea                                                               
    # 4     2 company  Harman International                                                      
    # 5    NA type     private                                                                   
    # 6    NA based    United States                                                             
    # 7    NA industry Electronics                                                               
    # 8     3 company  JBL                                                                       
    # 9    NA type     subsidiary                                                                
    # 10    NA based    United States                                                             
    # 11    NA industry Audio                                                                     
    # 12     1 company  Amazaon                                                                   
    # 13    NA type     public                                                                    
    # 14    NA based    United States                                                             
    # 15    NA industry Cloud computing, e-commerce, artificial intelligence, consumer electronics
    

    DISTRIBUTE THE HIERARCHY

    Each company gets a LVL.1, LVL.2, LVL.3 structure. The “” make it work out right when we fill``.

    further <- nextly %>% 
      mutate(LVL.1 = ifelse(LVL == 1, VALUE, NA_character_),
             LVL.2 = case_when(LVL == 1 ~ "",
                               LVL == 2 ~ VALUE,
                               TRUE ~ NA_character_),
             LVL.3 = ifelse(LVL == 3, VALUE, "")) %>% 
      fill(starts_with("LVL.")) %>% 
      filter(ATTR != "company") %>% 
      select(LVL.1, LVL.2, LVL.3, ATTR, VALUE)
    
    
    # A tibble: 11 x 5
    # LVL.1   LVL.2                LVL.3 ATTR     VALUE                                                                     
    # <chr>   <chr>                <chr> <chr>    <chr>                                                                     
    # 1 Samsung ""                   ""    type     private                                                                   
    # 2 Samsung ""                   ""    based    South Korea                                                               
    # 3 Samsung Harman International ""    type     private                                                                   
    # 4 Samsung Harman International ""    based    United States                                                             
    # 5 Samsung Harman International ""    industry Electronics                                                               
    # 6 Samsung Harman International JBL   type     subsidiary                                                                
    # 7 Samsung Harman International JBL   based    United States                                                             
    # 8 Samsung Harman International JBL   industry Audio                                                                     
    # 9 Amazaon ""                   ""    type     public                                                                    
    # 10 Amazaon ""                   ""    based    United States                                                             
    # 11 Amazaon ""                   ""    industry Cloud computing, e-commerce, artificial intelligence, consumer electronics
    

    HANDLE AMAZAON’S MULTIPLE INDUSTRIES

    Finally, lets str_split and unnes those ‘industry’ values for Amazaon.

    finally <- further %>% 
      mutate(VALUE = str_split(VALUE, ",\s*")) %>% 
      unnest()
    
    
    # A tibble: 14 x 5
    # LVL.1   LVL.2                LVL.3 ATTR     VALUE                  
    # <chr>   <chr>                <chr> <chr>    <chr>                  
    # 1 Samsung ""                   ""    type     private                
    # 2 Samsung ""                   ""    based    South Korea            
    # 3 Samsung Harman International ""    type     private                
    # 4 Samsung Harman International ""    based    United States          
    # 5 Samsung Harman International ""    industry Electronics            
    # 6 Samsung Harman International JBL   type     subsidiary             
    # 7 Samsung Harman International JBL   based    United States          
    # 8 Samsung Harman International JBL   industry Audio                  
    # 9 Amazaon ""                   ""    type     public                 
    # 10 Amazaon ""                   ""    based    United States          
    # 11 Amazaon ""                   ""    industry Cloud computing        
    # 12 Amazaon ""                   ""    industry e-commerce             
    # 13 Amazaon ""                   ""    industry artificial intelligence
    # 14 Amazaon ""                   ""    industry consumer electronics   
    

    Q.E.D.

    LAGNAPPE

    further %>% 
      spread(key = "ATTR", value = "VALUE") %>% 
      mutate(industry = str_split(industry, ",\s*")) %>% 
      unnest()
    # A tibble: 7 x 6
      LVL.1   LVL.2                LVL.3 based         type       industry               
      <chr>   <chr>                <chr> <chr>         <chr>      <chr>                  
    1 Amazaon ""                   ""    United States public     Cloud computing        
    2 Amazaon ""                   ""    United States public     e-commerce             
    3 Amazaon ""                   ""    United States public     artificial intelligence
    4 Amazaon ""                   ""    United States public     consumer electronics   
    5 Samsung ""                   ""    South Korea   private    NA                     
    6 Samsung Harman International ""    United States private    Electronics            
    7 Samsung Harman International JBL   United States subsidiary Audio      
    
    Login or Signup to reply.
  2. Using content from the Note at the end, count the spaces at the beginning of each company line and use gsubfn to replace them with a level number giving L2. Then after trimming away leading spaces replace the first space on each line with a colon giving L3. The file is now in dcf format so read it in using read.dcf giving L4.

    Now generate a lv variable giving the level number as a number and generate sequential numeric ids for each row. Compute the parent id giving parent and then construct a data frame with what we have computed so far. The overall root of the tree will be denoted by 0. From DF generate an edgelist, e, for the graph and convert that to an igraph. From that generate the simple paths and create a data frame DF2 having columns paths, company, type, based and industry such that each row represents one node other than the root.

    If you wish you can add lv and parent to the data frame which we computed but did not add since you may not need those.

    The assumption below is that each indent is 4 spaces.

    There is no restriction on how deep the levels can go.

    We can search DF2 using data frame operations for various text based queries such as

    subset(DF2, grepl("Samsung", paths))  # Samsung and its descendents
    

    or we can use igraph functions for graph queries on g such as

    max(length(get.diameter(g))) - 1   # max depth not counting root
    

    or we can use data.tree functions for queries

    dt$height -  1  # max depth not counting root
    

    Code

    The code follows.

    library(gsubfn)
    
    content <- readLines(textConnection(Lines))
    L2 <- gsubfn("( *)company", ~ paste0("level ", nchar(x) / 4L + 1L, "ncompany"), content)
    L3 <- sub(" ", ":", trimws(readLines(textConnection(L2))))
    L4 <- read.dcf(textConnection(L3))
    lv <- as.numeric(L4[, 1])
    id <- seq_along(lv)
    company <- L4[, "company"]
    parent <- sapply(id, function(i) c(tail(which(lv[1:i] < lv[i]), 1), 0)[1])  
    
    DF <- data.frame(id = company[id], parent = c("0", company)[parent+1], 
      level = lv, L4[, -1], stringsAsFactors = FALSE)
    e <- with(DF, cbind(parent, id))
    

    igraph

    Now that we have an edge list we can create an igraph and process it using that package.

    library(igraph)
    
    g <- graph_from_edgelist(e)
    
    p <- all_simple_paths(g, "0")
    paths <- sapply(p, function(x) paste(names(x), collapse = "/"))
    DF2 <- data.frame(paths, L4[, -1], stringsAsFactors = FALSE)
    DF2
    

    giving a paths column followed by the attributes of each node:

                                   paths              company       type         based                                                                   industry
    1                          0/Samsung              Samsung    private   South Korea                                                                       <NA>
    2     0/Samsung/Harman International Harman International    private United States                                                                Electronics
    3 0/Samsung/Harman International/JBL                  JBL subsidiary United States                                                                      Audio
    4                          0/Amazaon              Amazaon     public United States Cloud computing, e-commerce, artificial intelligence, consumer electronics
    

    We can plot the graph like this:

    plot(g, layout = layout_as_tree(g))
    

    (continued from graph)
    screenshot

    data.tree

    We could also use data.tree and its many functions to process this:

    library(data.tree)
    library(DiagrammeR)
    
    dt <- FromDataFrameNetwork(DF)
    print(dt, "type", "based", "industry")
    

    giving:

                         levelName       type         based                                                                   industry
    1 0                                                                                                                               
    2  ¦--Samsung                     private   South Korea                                                                           
    3  ¦   °--Harman International    private United States                                                                Electronics
    4  ¦       °--JBL              subsidiary United States                                                                      Audio
    5  °--Amazaon                      public United States Cloud computing, e-commerce, artificial intelligence, consumer electronics
    

    We can plot or convert the data tree data as follows

    plot(dt)  # plot in browser
    ToListSimple(dt) # convert to nested list
    ToListExplicit(dt) # similar but children in children component
    

    Note

    We can create content reproducibly like this:

    Lines <- "
    company Samsung
    type private
    based South Korea
    
        company Harman International
        type private
        based United States
        industry Electronics
    
            company JBL
            type subsidiary
            based United States
            industry Audio
    
    company Amazaon
    type public
    based United States
    industry Cloud computing, e-commerce, artificial intelligence, consumer electronics"
    
    content <- readLines(textConnection(Lines))
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search