I´ve got a data frame with a column consisting of Strings (project_skills) which denotes the skills a certain job (job_id) affords. I want to split this string for every job to get an vector of the skills a job affords and then create a Document Term Matrix to denote which skills (out of every possible skill) a certain job affords.
I´ve got the following data frame:
job_id project_skills
107182 CSS,HTML,Joomla,PHP
108169 XTCommerce,Magento,Prestashop,VirtueMart,osCommerce
112969 Google Search Console,Google Analytics,Google Webmaster Central,C++,Java,C#
114660 Marketing,Email Marketing
118686 PHP
The result should look anything like this (which is basically a Document Term Matrix with comma separated phrases:
project_skills
job_id CSS HTML PHP Google Search Console Google Analytics Java ...
107182 1 0 0 ...
108169 0 0 0 0 0
112969 0 0 0 1 1 ...
114660 0 0 0 ...
118686 0 0 1 ...
I´ve tried the following:
df <- data.frame(job_id = c(107182, 108169, 112969, 114660, 118686), project_skills = c("CSS,HTML,Joomla,PHP", "XTCommerce,Magento,Prestashop,VirtueMart,osCommerce", "Google Search Console,Google Analytics,Google Webmaster Central,C++,Java,C#", "Marketing,Email Marketing", "PHP"))
corpus <- Corpus(VectorSource(df$project_skills))
corpus <- tm_map(corpus, function(x) {
PlainTextDocument(
strsplit(x,"\,")[[1]],
id=ID(x)
)
})
inspect(corpus)
dtm <- DocumentTermMatrix(corpus)
as.matrix(dtm)
But unfortunately this splits all words and not at the commas (e.g. Google Search Console should be treated as one Term in the DTM).
2
Answers
tm (or some other text mining packages) split on words (spaces) and if you don’t check tend to remove punctuation like + and #. The simplest option is just to use
strsplit
. I show an option below using tidyr and dplyr. First group by job_id, next split the columns. This will create a nesting, which when unnested creates a long data.frame. Here I add the value 1 for every entry which acts like a 1 in a document term matrix. Then spread into a wide format to get your expected output. If you look at the resulting structure the column names are what you expect, not showing the tilde (~).There are many solutions to this but strsplit is your friend. That’s exactly what is done in the following code: