Remove numbers & commas 2+ spaces right of word in HTML?

HowardBaek
May 17, 2023
143 views
3 votes
3 Answers

I’m trying to scrape this table of regions that support Microsoft’s Speech service. I’ve managed to get the following character vector:

region <- c("southafricanorth 6", "eastasia 5", "southeastasia 1,2,3,4,5", 
"australiaeast 1,2,3,4", "centralindia 1,2,3,4,5", "japaneast 2,5", 
"japanwest", "koreacentral 2", "canadacentral 1", "northeurope 1,2,4,5", 
"westeurope 1,2,3,4,5", "francecentral", "germanywestcentral", 
"norwayeast", "switzerlandnorth 6", "switzerlandwest", "uksouth 1,2,3,4", 
"uaenorth 6", "brazilsouth 6", "centralus", "eastus 1,2,3,4,5", 
"eastus2 1,2,4,5", "northcentralus 4,6", "southcentralus 1,2,3,4,5,6", 
"westcentralus 5", "westus 2,5", "westus2 1,2,4,5", "westus3"
)

What is the regex that gets rid of all the numbers and commas that are at least 2 spaces to the right of the words? For ex, I just want westus2, instead of westus2 1,2,4,5.

I’ve tried this to no avail: gsub("\s{2,}\d+.*", "", region)

Tags: html r replace

Answers

The regions names without the superscripts are contained inside <code> tags in the HTML. So you could avoid the need for regexes by modifying your scraping code to something like:

library(rvest)

url <- "https://learn.microsoft.com/en-us/azure/cognitive-services/Speech-Service/regions"

regions <- read_html(url) %>% 
  # first table only
  html_element("table") %>% 
  html_elements("code") %>% 
  html_text()

regions

[1] "southafricanorth"   "eastasia"           "southeastasia"      "australiaeast"      
    "centralindia"       "japaneast"          "japanwest"          "koreacentral"      
[9] "canadacentral"      "northeurope"        "westeurope"         "francecentral"      
    "germanywestcentral" "norwayeast"         "switzerlandnorth"   "switzerlandwest"   
[17] "uksouth"            "uaenorth"           "brazilsouth"        "centralus"          
     "eastus"             "eastus2"            "northcentralus"     "southcentralus"    
[25] "westcentralus"      "westus"             "westus2"            "westus3"

Another elegant solution is word() function from stringr package:

The first word is default:

word(string, start = 1L, end = start, sep = fixed(" "))

library(stringr)

word(region)

 [1] "southafricanorth"   "eastasia"           "southeastasia"      "australiaeast"     
 [5] "centralindia"       "japaneast"          "japanwest"          "koreacentral"      
 [9] "canadacentral"      "northeurope"        "westeurope"         "francecentral"     
[13] "germanywestcentral" "norwayeast"         "switzerlandnorth"   "switzerlandwest"   
[17] "uksouth"            "uaenorth"           "brazilsouth"        "centralus"         
[21] "eastus"             "eastus2"            "northcentralus"     "southcentralus"    
[25] "westcentralus"      "westus"             "westus2"            "westus3"

Your regex does not match because you string does not have two spaces. If you change \s{2,} to \s or it should give the expected result.

sub("\s\d+.*", "", region)
# [1] "southafricanorth"   "eastasia"           "southeastasia"     
# [4] "australiaeast"      "centralindia"       "japaneast"         
# [7] "japanwest"          "koreacentral"       "canadacentral"     
#[10] "northeurope"        "westeurope"         "francecentral"     
#[13] "germanywestcentral" "norwayeast"         "switzerlandnorth"  
#[16] "switzerlandwest"    "uksouth"            "uaenorth"          
#[19] "brazilsouth"        "centralus"          "eastus"            
#[22] "eastus2"            "northcentralus"     "southcentralus"    
#[25] "westcentralus"      "westus"             "westus2"           
#[28] "westus3"

In this case it looks like that it could be simplified to

sub(" .*", "", region)

sub(" .+", "", region)

Please signup or login to give your own answer.

Click here to cancel reply.