skip to Main Content

First, I have a large dataframe of abstracts (180,000) which I have previously checked for the presence of a list of keywords (>300 keywords). I’ve added a column for each keyword to each abstract record indicating T-F if it is present in the abstract. So now the dataframe consists of the abstract (text) and 300+ columns, each associated with a keyword and T-F in the column if the keyword is present in the abstract.

DF-Abstracts:

Title, Abstract, 1-300 (ea col named for the keyword)

180,000 rows

Second, I have another dataframe with 1 row for each keyword (same keywords that make up the columns in dataframe DF_Abstracts) and each column in this dataframe is a "group name" that uses a selection of keywords to define the group. There are 23 groups defined. (each of these groups actually has freeform names like "energy" or "power" or "solar"). A 1 or 0 is listed in each row under the group names to indicate whether this keyword is part of the group

DF_Groups:

keywords, Group1, Group2, Group3 …. Group23

3D Print 1 0 0 1

Adv Matl 0 1 1 0

300 rows

What I would like to do is use these 2 data frames to determine in which group each abstract belongs. Each abstract cold be a member of more than 1 group. I would like to replace the table of keywords attached to each abstract with a table indicating to which of the 23 groups the abstract could belong.

So ultimately would like to see something like this:

DF_Abstracts:

Abstract Group1 ….. Group23

text T F

180,000 rows

I created some code with nested FOR loops to take the row of keywords (T-F), transpose it then check it against each of the 23 groups … and it works, but it will take more than an hour to process all 180,000 abstracts

Here is the code:

findGROUP <- function(i,x){sum(t(out[i,1:NCOL(DF_abstracts)])*DF_Groups[,x],na.rm=T)}

this function returns the number of matches when t(keyword row) is multiplied against the "group" column

for (i in 1:180000){ # ultimately NROW(DF_Abstracts)
  for (x in 2:23){tmp[x]<-findGROUP(i,x)>0

}
    out<-rbind(out,tmp)
}

There must be a simpler, faster way to do this, no?

Here is some actual data using dput on the 2 data tables (the dput output is too large to put all here though):

dput(DF_Abstracts[1,])
structure(list(`3D Printing` = FALSE, `3D Technology` = FALSE, 
    `A/B Testing` = FALSE, `Advanced Materials` = FALSE, Aerospace = FALSE, 
    `Air Transportation` = FALSE, Analytics = FALSE, Android = FALSE, 
    `Angel Investment` = FALSE, Animation = FALSE, `App Discovery` = FALSE, 
    `App Marketing` = FALSE, `Application Performance Management` = FALSE, 
    `Application Specific Integrated Circuit (ASIC)` = FALSE, 
    Apps = FALSE, `Archiving Service` = FALSE, `Artificial Intelligence` = FALSE, 
    `Asset Management` = FALSE, `Assistive Technology` = FALSE, 
    Audio = FALSE, `Augmented Reality` = FALSE, Automotive = FALSE, 
    `Autonomous Vehicles` = FALSE, B2B = FALSE, Battery = FALSE, 
    `Big Data` = FALSE, Biofuel = FALSE, Bioinformatics = FALSE, 
    `Biomass Energy` = FALSE, Biometrics = FALSE, Biopharma = FALSE, 
    Biotechnology = FALSE, Blockchain = FALSE, `Business Development` = FALSE, 
    `Business Information Systems` = FALSE, `Business Intelligence` = FALSE, 
    CAD = FALSE, Chemical = FALSE, `Chemical Engineering` = FALSE, 
    CivicTech = FALSE, `Civil Engineering` = FALSE, `Clean Energy` = FALSE, 
    CleanTech = FALSE, `Cloud Computing` = FALSE, `Cloud Data Services` = FALSE, 
    `Cloud Infrastructure` = FALSE, `Cloud Management` = FALSE, 
    `Cloud Security` = FALSE, `Cloud Storage` = FALSE), row.names = 1L, class = "data.frame")

 dput(DF_Groups[1,])
structure(list(name = "3D Printing", `LMV Interest` = 1, `NOT LMV` = 0, 
    `AI-Autonomy` = 0, Quantum = 0, `Cyber & Security` = 0, `Digital TX` = 1, 
    `5G.mil-Wireless Comms` = 0, `Electric Propulsion` = 0, Biotech = 0, 
    `Data & Computing` = 0, `Directed Energy` = 1, `Energy & Power` = 0, 
    `Human Potential` = 0, Materials = 0, `NextGen Electronics` = 0, 
    Sensors = 0, `Aerospace Tech` = 0, `AR-VR` = 0, `Climate-Related` = 0, 
    `Late Stage` = 1, `Semiconductor-IC` = 0, Logistics = 0), row.names = 1L, class = "data.frame")

2

Answers


  1. Chosen as BEST ANSWER

    Still slow, but simpler than previous attempt. Would still like to see something that runs faster ... taking more than 20min to run this on all the data.

     findLNSCP<-function(i,x){sum(t(DF_Abstracts[i,1:NCOL(DF_Abstracts)])*DF_Groups[,x],na.rm=T)} # check if matched 
    
    out_group<- matrix(FALSE, nrow=NROW(DF_Abstracts), ncol=NCOL(DF_Groups)-1 )
    colnames(out_group)[1:(NCOL(DF_Group)-1)]<- colnames(DF_Group)[2:NCOL(DF_Group)]
    
    for(i in 2:NCOL(DF_Group)){ #check each Group def against each abstract and determine if it's a member
    tmp<-sapply(1:NROW(DF_Abstracts), findLNSCP,i)>0 # creates a vector of T-F according to Group [i]
    out_group[,i-1]<-tmp
    }
    

  2. Let’s say you have n abstracts, k keywords and g groups.

    You should convert your data so that you have an abstract matrix of size n x k and a group matrix of size k x g. That way you can use matrix multiplication to do your calculations.
    In order to do so, you should make sure to:

    • reorder you group data frame so that the keywords are in the same order as in your abstract data frame
    • remove the abstract and keyword columns respectively from each data frame
    • convert your data frames to 1/0 matrices

    Here is a simplified example with 3 abstracts, 4 keywords and 5 groups:

       abstract_mat = matrix(c(1,0,0,1,1,0,0,0,1,0,0,0), ncol = 4)
       colnames(abstract_mat) = paste0("keyword",c(1:4))
    
    abstract_mat
            keyword1 keyword2 keyword3 keyword4
    [1,]        1        1        0        0
    [2,]        0        1        0        0
    [3,]        0        0        1        0
    
        group_mat = matrix(c(1,0,0,0,1,1,0,0,0,0,1,0,1,1,1,1,0,1,0,0), ncol = 5)
        colnames(group_mat) = paste0("group",c(1:5))
        rownames(group_mat) = paste0("keyword",c(1:4))
    
    group_mat
                 group1 group2 group3 group4 group5
    keyword1      1      1      0      1      0
    keyword2      0      1      0      1      1
    keyword3      0      0      1      1      0
    keyword4      0      0      0      1      0
    
        res = (t(t(abstract_mat %*% group_mat) - colSums(group_mat)) == 0)
    res
         group1 group2 group3 group4 group5
    [1,]   TRUE   TRUE  FALSE  FALSE   TRUE
    [2,]  FALSE  FALSE  FALSE  FALSE   TRUE
    [3,]  FALSE  FALSE   TRUE  FALSE  FALSE
    

    The t(abstract_mat %*% group_mat) element returns the number of keywords from group i found in each abstract, and the colSums(group_mat) gives you the total number of keywords present in each group. By taking their difference, you can find out if you have all the keywords from group i in each abstract (this is when the difference equals 0).

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search