R : Parsing and storing hierarchical structured data separated by space along with quick search and fast access - Artificial Intelligence

SanchitSaini
May 20, 2020
237 views
1 vote
2 Answers

I have a large data set with several fields along with its value separated by space.
Then these fields are combined to make a single record and each record can have children of variable length Indented with a tab.

content of the file looks something like this :

company Samsung
type private
based South Korea

    company Harman International
    type private
    based United States
    industry Electronics

        company JBL
        type subsidiary
        based United States
        industry Audio

company Amazaon
type public
based United States
industry Cloud computing, e-commerce, artificial intelligence, consumer electronics

I want to store these records while maintaining the hierarchical structure and with an option to do quick search and way to access every record.

So far I came up with this approach :

# reading file from the source
path <- "/path/to/file.txt"
content <- readLines(path, warn = F)


# replaces , with ; so it does not translate it as a separator in next step
content <- gsub(",", ";", content)

# creating list of fields and value
contentList <- read.csv(text=sub(" ", ",", content), header=FALSE)

# replacing ; with , to revert data in right format
contentList$V2 <- gsub(";", ",", contentList$V2)

After above step contentList look like this :

In the next step, I thought of using a function that would create a list with these rules:

if the field does not have any t add it to the list(as named vector)
if the field have one or more t make it a sub-list(as named vector) of previous record

But don’t know how this could be implemented in R.

How should I implement this?

Or Is there a better way to solve this problem that performs searching and accessing values quickly?

Tags: nested-lists r

Answers

RAW DATA IN

raw <- read_lines("company Samsung
type private
based South Korea

    company Harman International
    type private
    based United States
    industry Electronics

        company JBL
        type subsidiary
        based United States
        industry Audio

company Amazaon
type public
based United States
industry Cloud computing, e-commerce, artificial intelligence, consumer electronics")

PUT IN A TIBBLE AND GET THE INDENTUREMENT

library(tidyverse)
rawDf <- tibble(RAW = raw)

companyIndenture <- rawDf %>% 
    filter(str_detect(RAW, "^\s*company")) %>% 
    mutate(LVL = case_when(
      str_detect(RAW, "^\s{8}") ~ 3,
      str_detect(RAW, "^\s{4}") ~ 2,
      TRUE ~ 1),
      COMPANY = str_replace(RAW, "^\s*company\s", "")) %>% 
    select(-RAW)
# Gives us    
# A tibble: 4 x 2
# LVL COMPANY             
# <dbl> <chr>               
# 1     1 Samsung             
# 2     2 Harman International
# 3     3 JBL                 
# 4     1 Amazaon

CLEAN WHITESPACE

Now that we know what LVL each company is, lets get rid of some whitespace

nextly <- rawDf %>% 
  mutate(RAW = str_replace(RAW, "^\s*", "")) %>% 
  filter(RAW != "") %>% 
  separate(RAW, c("ATTR", "VALUE"), sep = " ", extra = "merge") %>% 
    # And bring the LVL back in
  left_join(companyIndenture, by = c("VALUE" = "COMPANY")) %>% 
  select(LVL, ATTR, VALUE)

# A tibble: 15 x 3
# LVL ATTR     VALUE                                                                     
# <dbl> <chr>    <chr>                                                                     
# 1     1 company  Samsung                                                                   
# 2    NA type     private                                                                   
# 3    NA based    South Korea                                                               
# 4     2 company  Harman International                                                      
# 5    NA type     private                                                                   
# 6    NA based    United States                                                             
# 7    NA industry Electronics                                                               
# 8     3 company  JBL                                                                       
# 9    NA type     subsidiary                                                                
# 10    NA based    United States                                                             
# 11    NA industry Audio                                                                     
# 12     1 company  Amazaon                                                                   
# 13    NA type     public                                                                    
# 14    NA based    United States                                                             
# 15    NA industry Cloud computing, e-commerce, artificial intelligence, consumer electronics

DISTRIBUTE THE HIERARCHY

Each company gets a LVL.1, LVL.2, LVL.3 structure. The “” make it work out right when we fill``.

further <- nextly %>% 
  mutate(LVL.1 = ifelse(LVL == 1, VALUE, NA_character_),
         LVL.2 = case_when(LVL == 1 ~ "",
                           LVL == 2 ~ VALUE,
                           TRUE ~ NA_character_),
         LVL.3 = ifelse(LVL == 3, VALUE, "")) %>% 
  fill(starts_with("LVL.")) %>% 
  filter(ATTR != "company") %>% 
  select(LVL.1, LVL.2, LVL.3, ATTR, VALUE)


# A tibble: 11 x 5
# LVL.1   LVL.2                LVL.3 ATTR     VALUE                                                                     
# <chr>   <chr>                <chr> <chr>    <chr>                                                                     
# 1 Samsung ""                   ""    type     private                                                                   
# 2 Samsung ""                   ""    based    South Korea                                                               
# 3 Samsung Harman International ""    type     private                                                                   
# 4 Samsung Harman International ""    based    United States                                                             
# 5 Samsung Harman International ""    industry Electronics                                                               
# 6 Samsung Harman International JBL   type     subsidiary                                                                
# 7 Samsung Harman International JBL   based    United States                                                             
# 8 Samsung Harman International JBL   industry Audio                                                                     
# 9 Amazaon ""                   ""    type     public                                                                    
# 10 Amazaon ""                   ""    based    United States                                                             
# 11 Amazaon ""                   ""    industry Cloud computing, e-commerce, artificial intelligence, consumer electronics

HANDLE AMAZAON’S MULTIPLE INDUSTRIES

Finally, lets str_split and unnes those ‘industry’ values for Amazaon.

finally <- further %>% 
  mutate(VALUE = str_split(VALUE, ",\s*")) %>% 
  unnest()


# A tibble: 14 x 5
# LVL.1   LVL.2                LVL.3 ATTR     VALUE                  
# <chr>   <chr>                <chr> <chr>    <chr>                  
# 1 Samsung ""                   ""    type     private                
# 2 Samsung ""                   ""    based    South Korea            
# 3 Samsung Harman International ""    type     private                
# 4 Samsung Harman International ""    based    United States          
# 5 Samsung Harman International ""    industry Electronics            
# 6 Samsung Harman International JBL   type     subsidiary             
# 7 Samsung Harman International JBL   based    United States          
# 8 Samsung Harman International JBL   industry Audio                  
# 9 Amazaon ""                   ""    type     public                 
# 10 Amazaon ""                   ""    based    United States          
# 11 Amazaon ""                   ""    industry Cloud computing        
# 12 Amazaon ""                   ""    industry e-commerce             
# 13 Amazaon ""                   ""    industry artificial intelligence
# 14 Amazaon ""                   ""    industry consumer electronics

Q.E.D.

LAGNAPPE

further %>% 
  spread(key = "ATTR", value = "VALUE") %>% 
  mutate(industry = str_split(industry, ",\s*")) %>% 
  unnest()
# A tibble: 7 x 6
  LVL.1   LVL.2                LVL.3 based         type       industry               
  <chr>   <chr>                <chr> <chr>         <chr>      <chr>                  
1 Amazaon ""                   ""    United States public     Cloud computing        
2 Amazaon ""                   ""    United States public     e-commerce             
3 Amazaon ""                   ""    United States public     artificial intelligence
4 Amazaon ""                   ""    United States public     consumer electronics   
5 Samsung ""                   ""    South Korea   private    NA                     
6 Samsung Harman International ""    United States private    Electronics            
7 Samsung Harman International JBL   United States subsidiary Audio

Using content from the Note at the end, count the spaces at the beginning of each company line and use gsubfn to replace them with a level number giving L2. Then after trimming away leading spaces replace the first space on each line with a colon giving L3. The file is now in dcf format so read it in using read.dcf giving L4.

Now generate a lv variable giving the level number as a number and generate sequential numeric ids for each row. Compute the parent id giving parent and then construct a data frame with what we have computed so far. The overall root of the tree will be denoted by 0. From DF generate an edgelist, e, for the graph and convert that to an igraph. From that generate the simple paths and create a data frame DF2 having columns paths, company, type, based and industry such that each row represents one node other than the root.

If you wish you can add lv and parent to the data frame which we computed but did not add since you may not need those.

The assumption below is that each indent is 4 spaces.

There is no restriction on how deep the levels can go.

We can search DF2 using data frame operations for various text based queries such as

subset(DF2, grepl("Samsung", paths))  # Samsung and its descendents

or we can use igraph functions for graph queries on g such as

max(length(get.diameter(g))) - 1   # max depth not counting root

or we can use data.tree functions for queries

dt$height -  1  # max depth not counting root

Code

The code follows.

library(gsubfn)

content <- readLines(textConnection(Lines))
L2 <- gsubfn("( *)company", ~ paste0("level ", nchar(x) / 4L + 1L, "ncompany"), content)
L3 <- sub(" ", ":", trimws(readLines(textConnection(L2))))
L4 <- read.dcf(textConnection(L3))
lv <- as.numeric(L4[, 1])
id <- seq_along(lv)
company <- L4[, "company"]
parent <- sapply(id, function(i) c(tail(which(lv[1:i] < lv[i]), 1), 0)[1])  

DF <- data.frame(id = company[id], parent = c("0", company)[parent+1], 
  level = lv, L4[, -1], stringsAsFactors = FALSE)
e <- with(DF, cbind(parent, id))

igraph

Now that we have an edge list we can create an igraph and process it using that package.

library(igraph)

g <- graph_from_edgelist(e)

p <- all_simple_paths(g, "0")
paths <- sapply(p, function(x) paste(names(x), collapse = "/"))
DF2 <- data.frame(paths, L4[, -1], stringsAsFactors = FALSE)
DF2

giving a paths column followed by the attributes of each node:

                               paths              company       type         based                                                                   industry
1                          0/Samsung              Samsung    private   South Korea                                                                       <NA>
2     0/Samsung/Harman International Harman International    private United States                                                                Electronics
3 0/Samsung/Harman International/JBL                  JBL subsidiary United States                                                                      Audio
4                          0/Amazaon              Amazaon     public United States Cloud computing, e-commerce, artificial intelligence, consumer electronics

We can plot the graph like this:

plot(g, layout = layout_as_tree(g))

(continued from graph)

data.tree

We could also use data.tree and its many functions to process this:

library(data.tree)
library(DiagrammeR)

dt <- FromDataFrameNetwork(DF)
print(dt, "type", "based", "industry")

giving:

                     levelName       type         based                                                                   industry
1 0                                                                                                                               
2  ¦--Samsung                     private   South Korea                                                                           
3  ¦   °--Harman International    private United States                                                                Electronics
4  ¦       °--JBL              subsidiary United States                                                                      Audio
5  °--Amazaon                      public United States Cloud computing, e-commerce, artificial intelligence, consumer electronics

We can plot or convert the data tree data as follows

plot(dt)  # plot in browser
ToListSimple(dt) # convert to nested list
ToListExplicit(dt) # similar but children in children component

Note

We can create content reproducibly like this:

Lines <- "
company Samsung
type private
based South Korea

    company Harman International
    type private
    based United States
    industry Electronics

        company JBL
        type subsidiary
        based United States
        industry Audio

company Amazaon
type public
based United States
industry Cloud computing, e-commerce, artificial intelligence, consumer electronics"

content <- readLines(textConnection(Lines))

Please signup or login to give your own answer.

Click here to cancel reply.

R : Parsing and storing hierarchical structured data separated by space along with quick search and fast access – Artificial Intelligence