I have a large data set with several fields along with its value separated by space.
Then these fields are combined to make a single record and each record can have children of variable length Indented with a tab.
content of the file looks something like this :
company Samsung
type private
based South Korea
company Harman International
type private
based United States
industry Electronics
company JBL
type subsidiary
based United States
industry Audio
company Amazaon
type public
based United States
industry Cloud computing, e-commerce, artificial intelligence, consumer electronics
I want to store these records while maintaining the hierarchical structure and with an option to do quick search and way to access every record.
So far I came up with this approach :
# reading file from the source
path <- "/path/to/file.txt"
content <- readLines(path, warn = F)
# replaces , with ; so it does not translate it as a separator in next step
content <- gsub(",", ";", content)
# creating list of fields and value
contentList <- read.csv(text=sub(" ", ",", content), header=FALSE)
# replacing ; with , to revert data in right format
contentList$V2 <- gsub(";", ",", contentList$V2)
After above step contentList
look like this :
In the next step, I thought of using a function that would create a list with these rules:
- if the field does not have any
t
add it to the list(as named vector) - if the field have one or more
t
make it a sub-list(as named vector) of previous record
But don’t know how this could be implemented in R.
How should I implement this?
Or Is there a better way to solve this problem that performs searching and accessing values quickly?
2
Answers
RAW DATA IN
PUT IN A TIBBLE AND GET THE INDENTUREMENT
CLEAN WHITESPACE
Now that we know what LVL each company is, lets get rid of some whitespace
DISTRIBUTE THE HIERARCHY
Each company gets a LVL.1, LVL.2, LVL.3 structure. The “” make it work out right when we
fill``.
HANDLE AMAZAON’S MULTIPLE INDUSTRIES
Finally, lets str_split and unnes those ‘industry’ values for Amazaon.
Q.E.D.
LAGNAPPE
Using content from the Note at the end, count the spaces at the beginning of each company line and use gsubfn to replace them with a level number giving L2. Then after trimming away leading spaces replace the first space on each line with a colon giving L3. The file is now in dcf format so read it in using read.dcf giving L4.
Now generate a lv variable giving the level number as a number and generate sequential numeric ids for each row. Compute the parent id giving parent and then construct a data frame with what we have computed so far. The overall root of the tree will be denoted by 0. From DF generate an edgelist, e, for the graph and convert that to an igraph. From that generate the simple paths and create a data frame DF2 having columns paths, company, type, based and industry such that each row represents one node other than the root.
If you wish you can add lv and parent to the data frame which we computed but did not add since you may not need those.
The assumption below is that each indent is 4 spaces.
There is no restriction on how deep the levels can go.
We can search DF2 using data frame operations for various text based queries such as
or we can use igraph functions for graph queries on g such as
or we can use data.tree functions for queries
Code
The code follows.
igraph
Now that we have an edge list we can create an igraph and process it using that package.
giving a paths column followed by the attributes of each node:
We can plot the graph like this:
(continued from graph)
data.tree
We could also use data.tree and its many functions to process this:
giving:
We can plot or convert the data tree data as follows
Note
We can create content reproducibly like this: