Html - extracting text from webscraping

JF96
April 20, 2024
225 views
3 votes
3 Answers

I’m trying to get text from a website
My code works (sort of)

for (i in 1:no_urls) {
  this_url=urls_meetings[[i]]
  page=read_html(this_url)
  
  text=page |> html_elements("body") |> html_text2()
  text_date=text[1]
  date<- str_extract(text_date, "\b\w+ \d{1,2}, \d{4}\b")
  # Convert the abbreviated month name to its full form
  date_str <- gsub("^(.*)\s(\d{1,2}),\s(\d{4})$", "\1 \2, \3", date)

  # Convert to Date object
  date <- mdy(date_str)
  date_1=as.character(date)
  date_1=gsub("-", "", date_1)


  text=text[2]
  statements_list2[[i]]=text
  names(statements_list)[i] <- date_1

}

The problem is the output if the line

text=page |> html_elements("body") |> html_text2()

which gives me the entire text of the page

[1] "r rr rnRelease Date: January 29, 2003rnnnnnr For immediate releasernnrnnrrnnr The Federal Open Market Committee decided today to keep its target for the federal funds rate unchanged at 1-1/4 percent. rnnr Oil price premiums and other aspects of geopolitical risks have reportedly fostered continued restraint on spending and hiring by businesses. However, the Committee believes that as those risks lift, as most analysts expect, the accommodative stance of monetary policy, coupled with ongoing growth in productivity, will provide support to an improving economic climate over time.rnnr In these circumstances, the Committee believes that, against the background of its long-run goals of price stability and sustainable economic growth and of the information currently available, the risks are balanced with respect to the prospects for both goals for the foreseeable future. rnnr Voting for the FOMC monetary policy action were Alan Greenspan, Chairman; William J. McDonough, Vice Chairman; Ben S. Bernanke, Susan S. Bies; J. Alfred Broaddus, Jr.; Roger W. Ferguson, Jr.; Edward M. Gramlich; Jack Guynn; Donald L. Kohn; Michael H. Moskow; Mark W. Olson, and Robert T. Parry. r r rnnr -----------------------------------------------------------------------------------------r DO NOT REMOVE: Wireless Generationr ------------------------------------------------------------------------------------------r 2003 Monetary policy rnnHome | News and r eventsnAccessibilityrnr Last update: January 29, 2003rr rn(function(){if (!document.body) return;var js = "window['__CF$cv$params']={r:'8775c6b49a2a2015',t:'MTcxMzYyMjgzOC41MjIwMDA='};_cpo=document.createElement('script');_cpo.nonce='',_cpo.src='/cdn-cgi/challenge-platform/scripts/jsd/main.js',document.getElementsByTagName('head')[0].appendChild(_cpo);";var _0xh = document.createElement('iframe');_0xh.height = 1;_0xh.width = 1;_0xh.style.position = 'absolute';_0xh.style.top = 0;_0xh.style.left = 0;_0xh.style.border = 'none';_0xh.style.visibility = 'hidden';document.body.appendChild(_0xh);function handler() {var _0xi = _0xh.contentDocument || _0xh.contentWindow.document;if (_0xi) {var _0xj = _0xi.createElement('script');_0xj.innerHTML = js;_0xi.getElementsByTagName('head')[0].appendChild(_0xj);}}if (document.readyState !== 'loading') {handler();} else if (window.addEventListener) {document.addEventListener('DOMContentLoaded', handler);} else {var prev = document.onreadystatechange || function () {};document.onreadystatechange = function (e) {prev(e);if (document.readyState !== 'loading') {document.onreadystatechange = prev;handler();}};}})();"

I need to keep only the relevant text. I’ve tried all sorts of things

str_extract(text, "(?<=The Federal Open Market)(.*?)(?=Voting)")


 str_match(text, "The Federal Open Market(.*?)Voting")

but they all give me a null character in return

The ideal output is

The Federal Open Market Committee decided today to keep its target for the federal funds rate unchanged at 1-1/4 percent. rnnr Oil price premiums and other aspects of geopolitical risks have reportedly fostered continued restraint on spending and hiring by businesses. However, the Committee believes that as those risks lift, as most analysts expect, the accommodative stance of monetary policy, coupled with ongoing growth in productivity, will provide support to an improving economic climate over time.rnnr In these circumstances, the Committee believes that, against the background of its long-run goals of price stability and sustainable economic growth and of the information currently available, the risks are balanced with respect to the prospects for both goals for the foreseeable future.

Tags: html r rvest web-scraping

Answers

- SamR
- April 20, 2024 at 5:06 pm
- 0 votes
0
The . character does not match new lines by default

The reason your pattern isn’t working is because you have new lines in your string. The definition of the . metacharacter is that it matches any character except a newline.

Here’s a shorter example:
```
txt <- "there are somernwords here"
str_extract(txt, "some.+words")
# [1] NA
```
Override the defaults

To override the defaults in stringr::str_extract(), you need to use stringr::regex() with the relevant option. In this case,

You can allow . to match everything, including n, by setting dotall = TRUE:
```
str_extract(txt, regex("some.+words", dotall = TRUE))
# [1] "somernwords"
```
Or in the case of your string:
```
str_extract(text, regex("(?<=The Federal Open Market)(.*?)(?=Voting)", dotall = TRUE))  |> 
    trimws()
# [1] "Committee decided today to keep its target for the federal funds rate unchanged at 1-1/4 percent. rnnr Oil price premiums and other aspects of geopolitical risks have reportedly fostered continued restraint on spending and hiring by businesses. However, the Committee believes that as those risks lift, as most analysts expect, the accommodative stance of monetary policy, coupled with ongoing growth in productivity, will provide support to an improving economic climate over time.rnnr In these circumstances, the Committee believes that, against the background of its long-run goals of price stability and sustainable economic growth and of the information currently available, the risks are balanced with respect to the prospects for both goals for the foreseeable future."
```
I have also passed this to trimws() to remove the leading and trailing whitespace.

Other multi line options

If you want to generally extend your regular expressions to match multiple lines (and not just the . metacharacter), you can use regex(pattern, multiline = TRUE):
For multiline strings, you can use regex(multiline = TRUE). This changes the behaviour of ^ and $, and introduces three new operators:
- ^ now matches the start of each line.
- $ now matches the end of each line.
- A matches the start of the input.
- z matches the end of the input.
- Z matches the end of the input, but before the final line terminator, if it exists.
See the stringr docs for more.
Login or Signup to reply.

- Dave2e
- April 20, 2024 at 5:22 pm
- 0 votes
0
Looking at the html, it looks like you can extract just the table:
```
this_url <- "https://www.federalreserve.gov/boarddocs/press/general/2002/20020130/"
page=read_html(this_url)

text=page |> html_elements("table") |> html_text() |> trimws()
```
And just get this:

[1] "The Federal Open Market Committee decided today to keep its
target for the federal funds rate unchanged at 1-3/4
percent.rnrnSigns that weakness in demand is abating and economic
activity is beginning to firm have become more prevalent. With the
forces restraining the economy starting to diminish, and with the
long-term prospects for productivity growth remaining favorable and
monetary policy accommodative, the outlook for economic recovery has
become more promising.rnrnThe degree of any strength in business
capital and household spending, however, is still uncertain. Hence,
the Committee continues to believe that, against the background of its
long-run goals of price stability and sustainable economic growth and
of the information currently available, the risks are weighted mainly
toward conditions that may generate economic weakness in the
foreseeable future.rnrnrnrn2002 Monetary policy rnrnHome
|rnNews and eventsrnAccessibilityrnrnLast update: January 30,
2002"

Or separate the paragraphs with this:
```
page |> html_elements("table p") |> html_text() |> trimws()
```
Login or Signup to reply.

Assuming the structure is more or less stable, you can specify which positional paragraphs to include / exclude.

library(rvest)
library(stringr)

# list of urls
urls_ <- c("https://www.federalreserve.gov/boarddocs/press/general/2002/20020130/")

collect_text <- function(url_){
  html <- read_html(url_)
  
  release_date <- 
    html_element(html, "body > font > i") |> 
    html_text() |> 
    str_split_i(": ", 2) |> 
    lubridate::mdy()
  
  text <- 
    # all p elements in td, except the last one
    html_elements(html, "td > p:not(p:last-of-type)") |> 
    html_text(trim = TRUE) |> 
    str_c(collapse = "") |>
    str_squish()

  # reurn named list
  list(release_date = release_date, text = text)
}

df <- 
  lapply(urls_, collect_text) |>
  dplyr::bind_rows()
df  
#> # A tibble: 1 × 2
#>   release_date text                                                             
#>   <date>       <chr>                                                            
#> 1 2002-01-30   The Federal Open Market Committee decided today to keep its targ…

str_view(df[1,"text"])
#> [1] │ The Federal Open Market Committee decided today to keep its target for the federal funds rate unchanged at 1-3/4 percent.Signs that weakness in demand is abating and economic activity is beginning to firm have become more prevalent. With the forces restraining the economy starting to diminish, and with the long-term prospects for productivity growth remaining favorable and monetary policy accommodative, the outlook for economic recovery has become more promising.The degree of any strength in business capital and household spending, however, is still uncertain. Hence, the Committee continues to believe that, against the background of its long-run goals of price stability and sustainable economic growth and of the information currently available, the risks are weighted mainly toward conditions that may generate economic weakness in the foreseeable future.

Please signup or login to give your own answer.

Click here to cancel reply.

Html – extracting text from webscraping

Answers

The `.` character does not match new lines by default

Override the defaults

Other multi line options

Html – extracting text from webscraping

Answers

The . character does not match new lines by default

Override the defaults

Other multi line options

The `.` character does not match new lines by default