skip to Main Content

I’m trying to scrape this table of regions that support Microsoft’s Speech service. I’ve managed to get the following character vector:

region <- c("southafricanorth 6", "eastasia 5", "southeastasia 1,2,3,4,5", 
"australiaeast 1,2,3,4", "centralindia 1,2,3,4,5", "japaneast 2,5", 
"japanwest", "koreacentral 2", "canadacentral 1", "northeurope 1,2,4,5", 
"westeurope 1,2,3,4,5", "francecentral", "germanywestcentral", 
"norwayeast", "switzerlandnorth 6", "switzerlandwest", "uksouth 1,2,3,4", 
"uaenorth 6", "brazilsouth 6", "centralus", "eastus 1,2,3,4,5", 
"eastus2 1,2,4,5", "northcentralus 4,6", "southcentralus 1,2,3,4,5,6", 
"westcentralus 5", "westus 2,5", "westus2 1,2,4,5", "westus3"
)

What is the regex that gets rid of all the numbers and commas that are at least 2 spaces to the right of the words? For ex, I just want westus2, instead of westus2 1,2,4,5.

I’ve tried this to no avail: gsub("\s{2,}\d+.*", "", region)

3

Answers


  1. The regions names without the superscripts are contained inside <code> tags in the HTML. So you could avoid the need for regexes by modifying your scraping code to something like:

    library(rvest)
    
    url <- "https://learn.microsoft.com/en-us/azure/cognitive-services/Speech-Service/regions"
    
    regions <- read_html(url) %>% 
      # first table only
      html_element("table") %>% 
      html_elements("code") %>% 
      html_text()
    
    regions
    
    [1] "southafricanorth"   "eastasia"           "southeastasia"      "australiaeast"      
        "centralindia"       "japaneast"          "japanwest"          "koreacentral"      
    [9] "canadacentral"      "northeurope"        "westeurope"         "francecentral"      
        "germanywestcentral" "norwayeast"         "switzerlandnorth"   "switzerlandwest"   
    [17] "uksouth"            "uaenorth"           "brazilsouth"        "centralus"          
         "eastus"             "eastus2"            "northcentralus"     "southcentralus"    
    [25] "westcentralus"      "westus"             "westus2"            "westus3"
    
    Login or Signup to reply.
  2. Another elegant solution is word() function from stringr package:

    The first word is default:

    word(string, start = 1L, end = start, sep = fixed(" "))

    library(stringr)
    
    word(region)
    
     [1] "southafricanorth"   "eastasia"           "southeastasia"      "australiaeast"     
     [5] "centralindia"       "japaneast"          "japanwest"          "koreacentral"      
     [9] "canadacentral"      "northeurope"        "westeurope"         "francecentral"     
    [13] "germanywestcentral" "norwayeast"         "switzerlandnorth"   "switzerlandwest"   
    [17] "uksouth"            "uaenorth"           "brazilsouth"        "centralus"         
    [21] "eastus"             "eastus2"            "northcentralus"     "southcentralus"    
    [25] "westcentralus"      "westus"             "westus2"            "westus3"
    
    Login or Signup to reply.
  3. Your regex does not match because you string does not have two spaces. If you change \s{2,} to \s or it should give the expected result.

    sub("\s\d+.*", "", region)
    # [1] "southafricanorth"   "eastasia"           "southeastasia"     
    # [4] "australiaeast"      "centralindia"       "japaneast"         
    # [7] "japanwest"          "koreacentral"       "canadacentral"     
    #[10] "northeurope"        "westeurope"         "francecentral"     
    #[13] "germanywestcentral" "norwayeast"         "switzerlandnorth"  
    #[16] "switzerlandwest"    "uksouth"            "uaenorth"          
    #[19] "brazilsouth"        "centralus"          "eastus"            
    #[22] "eastus2"            "northcentralus"     "southcentralus"    
    #[25] "westcentralus"      "westus"             "westus2"           
    #[28] "westus3"           
    

    In this case it looks like that it could be simplified to

    sub(" .*", "", region)
    

    or

    sub(" .+", "", region)
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search