I have long lists of strings such as this machine readable example:
A <- list(c("Biology","Cell Biology","Art","Humanities, Multidisciplinary; Psychology, Experimental","Astronomy & Astrophysics; Physics, Particles & Fields","Economics; Mathematics, Interdisciplinary Applications; Social Sciences, Mathematical Methods","Geriatrics & Gerontology","Gerontology","Management","Operations Research & Management Science","Computer Science, Artificial Intelligence; Computer Science, Information Systems; Engineering, Electrical & Electronic","Economics; Mathematics, Interdisciplinary Applications; Social Sciences, Mathematical Methods; Statistics & Probability"))
So it looks like this:
> A
[[1]]
[1] "Biology"
[2] "Cell Biology"
[3] "Art"
[4] "Humanities, Multidisciplinary; Psychology, Experimental"
[5] "Astronomy & Astrophysics; Physics, Particles & Fields"
[6] "Economics; Mathematics, Interdisciplinary Applications; Social Sciences, Mathematical Methods"
[7] "Geriatrics & Gerontology"
[8] "Gerontology"
[9] "Management"
[10] "Operations Research & Management Science"
[11] "Computer Science, Artificial Intelligence; Computer Science, Information Systems; Engineering, Electrical & Electronic"
[12] "Economics; Mathematics, Interdisciplinary Applications; Social Sciences, Mathematical Methods; Statistics & Probability"
I would like to edit these terms and eliminate duplicates in order to get this result:
[1] "Science"
[2] "Science"
[3] "Arts & Humanities"
[4] "Arts & Humanities; Social Sciences"
[5] "Science"
[6] "Social Sciences; Science"
[7] "Science"
[8] "Social Sciences"
[9] "Social Sciences"
[10] "Science"
[11] "Science"
[12] "Social Sciences; Science"
So far I only got this:
stringedit <- function(A)
{
A <-gsub("Biology", "Science", A)
A <-gsub("Cell Biology", "Science", A)
A <-gsub("Art", "Arts & Humanities", A)
A <-gsub("Humanities, Multidisciplinary", "Arts & Humanities", A)
A <-gsub("Psychology, Experimental", "Social Sciences", A)
A <-gsub("Astronomy & Astrophysics", "Science", A)
A <-gsub("Physics, Particles & Fields", "Science", A)
A <-gsub("Economics", "Social Sciences", A)
A <-gsub("Mathematics", "Science", A)
A <-gsub("Mathematics, Applied", "Science", A)
A <-gsub("Mathematics, Interdisciplinary Applications", "Science", A)
A <-gsub("Social Sciences, Mathematical Methods", "Social Sciences", A)
A <-gsub("Geriatrics & Gerontology", "Science", A)
A <-gsub("Gerontology", "Social Sciences", A)
A <-gsub("Management", "Social Sciences", A)
A <-gsub("Operations Research & Management Science", "Science", A)
A <-gsub("Computer Science, Artificial Intelligence", "Science", A)
A <-gsub("Computer Science, Information Systems", "Science", A)
A <-gsub("Engineering, Electrical & Electronic", "Science", A)
A <-gsub("Statistics & Probability", "Science", A)
}
B <- lapply(A, stringedit)
But it does not work properly:
> B
[[1]]
[1] "Science"
[2] "Cell Science"
[3] "Arts & Humanities"
[4] "Arts & Humanities; Social Sciences"
[5] "Science; Science"
[6] "Social Sciences; Science, Interdisciplinary Applications; Social Sciences"
[7] "Science"
[8] "Social Sciences"
[9] "Social Sciences"
[10] "Operations Research & Social Sciences Science"
[11] "Computer Science, Arts & Humanitiesificial Intelligence; Science; Science"
[12] "Social Sciences; Science, Interdisciplinary Applications; Social Sciences; Science"
How can I achieve the correct output mentioned above?
Thank you very much in advance for your consideration!
3
Answers
Let me start with one example. You have a string “Cell Biology”. The first substitution,
A <-gsub("Biology", "Science", A)
, turns it into “Cell Science”. Which then is not substituted.Since you do not use regular expressions, I would rather use a kind of a hash to do the substitutions:
Now, given a string such as “Biology”, you can quickly look up your category:
I am not sure why you want to use a list instead of a vector of strings, therefore I will simplify a bit your case:
The has lookup will not work for the composite strings (containing “;”). You can split them, however using
strsplit
. Then, you can useunique
to avoid term repetition, and put it back together using thepaste
function.Here is the result, as desired:
Of course, you can call
*apply
several times like that:This is not more or less efficient than the previous code.
Yes, you could achieve the same with regular expressions, but first, it would involve splitting the strings anyways, then using regex’s like
^Biology$
to make sure that they will match “Biology” but not “Cell Biology” etc. Unless you want to go for constructs like “.* Biology”. Finally, you would have to get rid of the duplicates anyways, and all it all it would be, in my opinion (i) less verbose (= more error prone) and (ii) not worth the effort.And how about using
switch
?Of course, this will work as long as you have proper strings shoved in as
switch
arguments. One mismatch, and you’ll end withNA
. For some advanced usage, you should write your own wrapper to usegrep
-family of functions, or evenagrep
(handle with care).I found it easiest to have a two-column
data.frame
as a lookup, with one column for the course name and one column for the category. Here’s an example:Then, assuming
A
as a list as in your question:match
matches the values fromA
with the course names in thecourse.categories
dataset and says which rows the match occurs on; this is used to extract the category the course belongs to. Then,unique
makes sure we just have one of each category.paste
puts things back together.