I’m trying to scrape this table of regions that support Microsoft’s Speech service. I’ve managed to get the following character vector:
region <- c("southafricanorth 6", "eastasia 5", "southeastasia 1,2,3,4,5",
"australiaeast 1,2,3,4", "centralindia 1,2,3,4,5", "japaneast 2,5",
"japanwest", "koreacentral 2", "canadacentral 1", "northeurope 1,2,4,5",
"westeurope 1,2,3,4,5", "francecentral", "germanywestcentral",
"norwayeast", "switzerlandnorth 6", "switzerlandwest", "uksouth 1,2,3,4",
"uaenorth 6", "brazilsouth 6", "centralus", "eastus 1,2,3,4,5",
"eastus2 1,2,4,5", "northcentralus 4,6", "southcentralus 1,2,3,4,5,6",
"westcentralus 5", "westus 2,5", "westus2 1,2,4,5", "westus3"
)
What is the regex that gets rid of all the numbers and commas that are at least 2 spaces to the right of the words? For ex, I just want westus2
, instead of westus2 1,2,4,5
.
I’ve tried this to no avail: gsub("\s{2,}\d+.*", "", region)
3
Answers
The regions names without the superscripts are contained inside
<code>
tags in the HTML. So you could avoid the need for regexes by modifying your scraping code to something like:Another elegant solution is
word()
function fromstringr
package:The first word is default:
word(string, start = 1L, end = start, sep = fixed(" "))
Your regex does not match because you string does not have two spaces. If you change
it should give the expected result.
\s{2,}
to\s
orIn this case it looks like that it could be simplified to
or