First, I have a large dataframe of abstracts (180,000) which I have previously checked for the presence of a list of keywords (>300 keywords). I’ve added a column for each keyword to each abstract record indicating T-F if it is present in the abstract. So now the dataframe consists of the abstract (text) and 300+ columns, each associated with a keyword and T-F in the column if the keyword is present in the abstract.
DF-Abstracts:
Title, Abstract, 1-300 (ea col named for the keyword)
180,000 rows
Second, I have another dataframe with 1 row for each keyword (same keywords that make up the columns in dataframe DF_Abstracts) and each column in this dataframe is a "group name" that uses a selection of keywords to define the group. There are 23 groups defined. (each of these groups actually has freeform names like "energy" or "power" or "solar"). A 1 or 0 is listed in each row under the group names to indicate whether this keyword is part of the group
DF_Groups:
keywords, Group1, Group2, Group3 …. Group23
3D Print 1 0 0 1
Adv Matl 0 1 1 0
300 rows
What I would like to do is use these 2 data frames to determine in which group each abstract belongs. Each abstract cold be a member of more than 1 group. I would like to replace the table of keywords attached to each abstract with a table indicating to which of the 23 groups the abstract could belong.
So ultimately would like to see something like this:
DF_Abstracts:
Abstract Group1 ….. Group23
text T F
180,000 rows
I created some code with nested FOR loops to take the row of keywords (T-F), transpose it then check it against each of the 23 groups … and it works, but it will take more than an hour to process all 180,000 abstracts
Here is the code:
findGROUP <- function(i,x){sum(t(out[i,1:NCOL(DF_abstracts)])*DF_Groups[,x],na.rm=T)}
this function returns the number of matches when t(keyword row) is multiplied against the "group" column
for (i in 1:180000){ # ultimately NROW(DF_Abstracts)
for (x in 2:23){tmp[x]<-findGROUP(i,x)>0
}
out<-rbind(out,tmp)
}
There must be a simpler, faster way to do this, no?
Here is some actual data using dput on the 2 data tables (the dput output is too large to put all here though):
dput(DF_Abstracts[1,])
structure(list(`3D Printing` = FALSE, `3D Technology` = FALSE,
`A/B Testing` = FALSE, `Advanced Materials` = FALSE, Aerospace = FALSE,
`Air Transportation` = FALSE, Analytics = FALSE, Android = FALSE,
`Angel Investment` = FALSE, Animation = FALSE, `App Discovery` = FALSE,
`App Marketing` = FALSE, `Application Performance Management` = FALSE,
`Application Specific Integrated Circuit (ASIC)` = FALSE,
Apps = FALSE, `Archiving Service` = FALSE, `Artificial Intelligence` = FALSE,
`Asset Management` = FALSE, `Assistive Technology` = FALSE,
Audio = FALSE, `Augmented Reality` = FALSE, Automotive = FALSE,
`Autonomous Vehicles` = FALSE, B2B = FALSE, Battery = FALSE,
`Big Data` = FALSE, Biofuel = FALSE, Bioinformatics = FALSE,
`Biomass Energy` = FALSE, Biometrics = FALSE, Biopharma = FALSE,
Biotechnology = FALSE, Blockchain = FALSE, `Business Development` = FALSE,
`Business Information Systems` = FALSE, `Business Intelligence` = FALSE,
CAD = FALSE, Chemical = FALSE, `Chemical Engineering` = FALSE,
CivicTech = FALSE, `Civil Engineering` = FALSE, `Clean Energy` = FALSE,
CleanTech = FALSE, `Cloud Computing` = FALSE, `Cloud Data Services` = FALSE,
`Cloud Infrastructure` = FALSE, `Cloud Management` = FALSE,
`Cloud Security` = FALSE, `Cloud Storage` = FALSE), row.names = 1L, class = "data.frame")
dput(DF_Groups[1,])
structure(list(name = "3D Printing", `LMV Interest` = 1, `NOT LMV` = 0,
`AI-Autonomy` = 0, Quantum = 0, `Cyber & Security` = 0, `Digital TX` = 1,
`5G.mil-Wireless Comms` = 0, `Electric Propulsion` = 0, Biotech = 0,
`Data & Computing` = 0, `Directed Energy` = 1, `Energy & Power` = 0,
`Human Potential` = 0, Materials = 0, `NextGen Electronics` = 0,
Sensors = 0, `Aerospace Tech` = 0, `AR-VR` = 0, `Climate-Related` = 0,
`Late Stage` = 1, `Semiconductor-IC` = 0, Logistics = 0), row.names = 1L, class = "data.frame")
2
Answers
Still slow, but simpler than previous attempt. Would still like to see something that runs faster ... taking more than 20min to run this on all the data.
Let’s say you have n abstracts, k keywords and g groups.
You should convert your data so that you have an abstract matrix of size
n x k
and a group matrix of sizek x g
. That way you can use matrix multiplication to do your calculations.In order to do so, you should make sure to:
Here is a simplified example with 3 abstracts, 4 keywords and 5 groups:
The
t(abstract_mat %*% group_mat)
element returns the number of keywords from group i found in each abstract, and thecolSums(group_mat)
gives you the total number of keywords present in each group. By taking their difference, you can find out if you have all the keywords from group i in each abstract (this is when the difference equals 0).