Proper use of gsub / regular expressions in R? - Artificial Intelligence

user1496104
October 22, 2012
126 views
3 votes
3 Answers

I have long lists of strings such as this machine readable example:

A <- list(c("Biology","Cell Biology","Art","Humanities, Multidisciplinary; Psychology, Experimental","Astronomy & Astrophysics; Physics, Particles & Fields","Economics; Mathematics, Interdisciplinary Applications; Social Sciences, Mathematical Methods","Geriatrics & Gerontology","Gerontology","Management","Operations Research & Management Science","Computer Science, Artificial Intelligence; Computer Science, Information Systems; Engineering, Electrical & Electronic","Economics; Mathematics, Interdisciplinary Applications; Social Sciences, Mathematical Methods; Statistics & Probability"))

So it looks like this:

> A  
[[1]]  
 [1] "Biology"  
 [2] "Cell Biology"  
 [3] "Art"  
 [4] "Humanities, Multidisciplinary; Psychology, Experimental"  
 [5] "Astronomy & Astrophysics; Physics, Particles & Fields"  
 [6] "Economics; Mathematics, Interdisciplinary Applications; Social Sciences, Mathematical Methods"  
 [7] "Geriatrics & Gerontology"  
 [8] "Gerontology"  
 [9] "Management"  
[10] "Operations Research & Management Science"  
[11] "Computer Science, Artificial Intelligence; Computer Science, Information Systems; Engineering, Electrical & Electronic"  
[12] "Economics; Mathematics, Interdisciplinary Applications; Social Sciences, Mathematical Methods; Statistics & Probability"

I would like to edit these terms and eliminate duplicates in order to get this result:

 [1] "Science"  
 [2] "Science"  
 [3] "Arts & Humanities"  
 [4] "Arts & Humanities; Social Sciences"  
 [5] "Science"  
 [6] "Social Sciences; Science"  
 [7] "Science"  
 [8] "Social Sciences"  
 [9] "Social Sciences"  
[10] "Science"  
[11] "Science"  
[12] "Social Sciences; Science"

So far I only got this:

stringedit <- function(A)  
{  
  A <-gsub("Biology", "Science", A)  
  A <-gsub("Cell Biology", "Science", A)  
  A <-gsub("Art", "Arts & Humanities", A)  
  A <-gsub("Humanities, Multidisciplinary", "Arts & Humanities", A)  
  A <-gsub("Psychology, Experimental", "Social Sciences", A)  
  A <-gsub("Astronomy & Astrophysics", "Science", A)  
  A <-gsub("Physics, Particles & Fields", "Science", A)  
  A <-gsub("Economics", "Social Sciences", A)  
  A <-gsub("Mathematics", "Science", A)  
  A <-gsub("Mathematics, Applied", "Science", A)  
  A <-gsub("Mathematics, Interdisciplinary Applications", "Science", A)  
  A <-gsub("Social Sciences, Mathematical Methods", "Social Sciences", A)  
  A <-gsub("Geriatrics & Gerontology", "Science", A)  
  A <-gsub("Gerontology", "Social Sciences", A)  
  A <-gsub("Management", "Social Sciences", A)  
  A <-gsub("Operations Research & Management Science", "Science", A)  
  A <-gsub("Computer Science, Artificial Intelligence", "Science", A)  
  A <-gsub("Computer Science, Information Systems", "Science", A)  
  A <-gsub("Engineering, Electrical & Electronic", "Science", A)  
  A <-gsub("Statistics & Probability", "Science", A)  
}  
B <- lapply(A, stringedit)

But it does not work properly:

> B  
[[1]]  
 [1] "Science"  
 [2] "Cell Science"  
 [3] "Arts & Humanities"  
 [4] "Arts & Humanities; Social Sciences"  
 [5] "Science; Science"  
 [6] "Social Sciences; Science, Interdisciplinary Applications; Social Sciences"  
 [7] "Science"  
 [8] "Social Sciences"  
 [9] "Social Sciences"  
[10] "Operations Research & Social Sciences Science"  
[11] "Computer Science, Arts & Humanitiesificial Intelligence; Science; Science"  
[12] "Social Sciences; Science, Interdisciplinary Applications; Social Sciences; Science"

How can I achieve the correct output mentioned above?
Thank you very much in advance for your consideration!

Tags: gsub list r regex

Answers

Let me start with one example. You have a string “Cell Biology”. The first substitution, A <-gsub("Biology", "Science", A), turns it into “Cell Science”. Which then is not substituted.

Since you do not use regular expressions, I would rather use a kind of a hash to do the substitutions:

myhash <- c( "Science", "Science", "Arts & Humanities", "Arts & Humanities", "Social Sciences", 
  "Science", "Science", "Social Sciences", "Science", "Science", "Science", "Social Sciences", 
  "Science", "Social Sciences", "Social Sciences", "Science", "Science", "Science", "Science", 
  "Science" )

names( myhash ) <- c( "Biology", "Cell Biology", "Art", "Humanities, Multidisciplinary", 
  "Psychology, Experimental", "Astronomy & Astrophysics", "Physics, Particles & Fields", "Economics", 
  "Mathematics", "Mathematics, Applied", "Mathematics, Interdisciplinary Applications", 
  "Social Sciences, Mathematical Methods", "Geriatrics & Gerontology", "Gerontology", "Management",
   "Operations Research & Management Science", "Computer Science, Artificial Intelligence", 
  "Computer Science, Information Systems", "Engineering, Electrical & Electronic", 
  "Statistics & Probability" )

Now, given a string such as “Biology”, you can quickly look up your category:

myhash[ "Biology" ]

I am not sure why you want to use a list instead of a vector of strings, therefore I will simplify a bit your case:

A <- c("Biology","Cell Biology","Art",
  "Humanities, Multidisciplinary; Psychology, Experimental",
  "Astronomy & Astrophysics; Physics, Particles & Fields",
  "Economics; Mathematics, Interdisciplinary Applications; Social Sciences, Mathematical Methods",
  "Geriatrics & Gerontology","Gerontology","Management","Operations Research & Management Science",
  "Computer Science, Artificial Intelligence; Computer Science, Information Systems; Engineering, Electrical & Electronic",
  "Economics; Mathematics, Interdisciplinary Applications; Social Sciences, Mathematical Methods; Statistics & Probability")

The has lookup will not work for the composite strings (containing “;”). You can split them, however using strsplit. Then, you can use unique to avoid term repetition, and put it back together using the paste function.

stringedit <- function( x ) { 
  # first, split into subterms
  a.all <- unlist( strsplit( x, "; *" ) ) ; 
  paste( unique( myhash[ a.all ] ), collapse= "; " ) 
}

unlist( lapply( A, stringedit  ) )

Here is the result, as desired:

[1] "Science"                            "Science"                            "Arts & Humanities"                  "Arts & Humanities; Social Sciences"
[5] "Science"                            "Social Sciences; Science"           "Science"                            "Social Sciences"                   
[9] "Social Sciences"                    "Science"                            "Science"                            "Social Sciences; Science"

Of course, you can call *apply several times like that:

a.spl <- sapply( A, strsplit, "; *" )
a.spl <- sapply( a.spl, function( x ) myhash[ x ] )
unlist( sapply( a.spl, collapse, "; " )

This is not more or less efficient than the previous code.

Yes, you could achieve the same with regular expressions, but first, it would involve splitting the strings anyways, then using regex’s like ^Biology$ to make sure that they will match “Biology” but not “Cell Biology” etc. Unless you want to go for constructs like “.* Biology”. Finally, you would have to get rid of the duplicates anyways, and all it all it would be, in my opinion (i) less verbose (= more error prone) and (ii) not worth the effort.

And how about using switch?

science.category <- function(science){
    switch(science,
           "Biology" =,
           "Cell Biology" =,
           "Astronomy & Astrophysics" =,
           "Physics, Particles & Fields" =,
           "Mathematics" =,
           "Mathematics, Applied" =,
           "Mathematics, Interdisciplinary Applications" =,
           "Geriatrics & Gerontology" =,
           "Operations Research & Management Science" =,
           "Computer Science, Artificial Intelligence" =,
           "Computer Science, Information Systems" =,
           "Engineering, Electrical & Electronic" =,
           "Statistics & Probability" = "Science",
           "Art" =,
           "Humanities, Multidisciplinary" = "Arts & Humanities",
           "Psychology, Experimental" =,
           "Economics" =,
           "Social Sciences, Mathematical Methods" =,
           "Gerontology" =,
           "Management" = "Social Sciences",
           NA
           )
}

a <- unlist(lapply(A, strsplit, split = " *; *"), recursive = FALSE)
a1 <- lapply(a, function(x) unique(sapply(x, science.category)))
sapply(a1, paste, collapse = "; ")

Of course, this will work as long as you have proper strings shoved in as switch arguments. One mismatch, and you’ll end with NA. For some advanced usage, you should write your own wrapper to use grep-family of functions, or even agrep (handle with care).

I found it easiest to have a two-column data.frame as a lookup, with one column for the course name and one column for the category. Here’s an example:

course.categories <- data.frame(
  Course = 
  c("Art", "Humanities, Multidisciplinary", "Biology", "Cell Biology", 
    "Astronomy & Astrophysics", "Physics, Particles & Fields", "Mathematics", 
    "Mathematics, Applied", "Mathematics, Interdisciplinary Applications", 
    "Geriatrics & Gerontology", "Operations Research & Management Science", 
    "Computer Science, Artificial Intelligence", 
    "Computer Science, Information Systems", 
    "Engineering, Electrical & Electronic", "Statistics & Probability", 
    "Psychology, Experimental", "Economics", 
    "Social Sciences, Mathematical Methods", 
    "Gerontology", "Management"),
  Category =
  c("Arts & Humanities", "Arts & Humanities", "Science", "Science", 
    "Science", "Science", "Science", "Science", "Science", "Science", 
    "Science", "Science", "Science", "Science", "Science", "Social Sciences", 
    "Social Sciences", "Social Sciences", "Social Sciences", "Social Sciences"))

Then, assuming A as a list as in your question:

sapply(strsplit(unlist(A), "; "), 
       function(x) 
         paste(unique(course.categories[match(x, course.categories[["Course"]]),
                                        "Category"]), 
               collapse = "; "))
#  [1] "Science"                            "Science"                           
#  [3] "Arts & Humanities"                  "Arts & Humanities; Social Sciences"
#  [5] "Science"                            "Social Sciences; Science"          
#  [7] "Science"                            "Social Sciences"                   
#  [9] "Social Sciences"                    "Science"                           
# [11] "Science"                            "Social Sciences; Science"

match matches the values from A with the course names in the course.categories dataset and says which rows the match occurs on; this is used to extract the category the course belongs to. Then, unique makes sure we just have one of each category. paste puts things back together.

Please signup or login to give your own answer.

Click here to cancel reply.

Proper use of gsub / regular expressions in R? – Artificial Intelligence

Answers