I’m trying to get text from a website
My code works (sort of)
for (i in 1:no_urls) {
this_url=urls_meetings[[i]]
page=read_html(this_url)
text=page |> html_elements("body") |> html_text2()
text_date=text[1]
date<- str_extract(text_date, "\b\w+ \d{1,2}, \d{4}\b")
# Convert the abbreviated month name to its full form
date_str <- gsub("^(.*)\s(\d{1,2}),\s(\d{4})$", "\1 \2, \3", date)
# Convert to Date object
date <- mdy(date_str)
date_1=as.character(date)
date_1=gsub("-", "", date_1)
text=text[2]
statements_list2[[i]]=text
names(statements_list)[i] <- date_1
}
The problem is the output if the line
text=page |> html_elements("body") |> html_text2()
which gives me the entire text of the page
[1] "r rr rnRelease Date: January 29, 2003rnnnnnr For immediate releasernnrnnrrnnr The Federal Open Market Committee decided today to keep its target for the federal funds rate unchanged at 1-1/4 percent. rnnr Oil price premiums and other aspects of geopolitical risks have reportedly fostered continued restraint on spending and hiring by businesses. However, the Committee believes that as those risks lift, as most analysts expect, the accommodative stance of monetary policy, coupled with ongoing growth in productivity, will provide support to an improving economic climate over time.rnnr In these circumstances, the Committee believes that, against the background of its long-run goals of price stability and sustainable economic growth and of the information currently available, the risks are balanced with respect to the prospects for both goals for the foreseeable future. rnnr Voting for the FOMC monetary policy action were Alan Greenspan, Chairman; William J. McDonough, Vice Chairman; Ben S. Bernanke, Susan S. Bies; J. Alfred Broaddus, Jr.; Roger W. Ferguson, Jr.; Edward M. Gramlich; Jack Guynn; Donald L. Kohn; Michael H. Moskow; Mark W. Olson, and Robert T. Parry. r r rnnr -----------------------------------------------------------------------------------------r DO NOT REMOVE: Wireless Generationr ------------------------------------------------------------------------------------------r 2003 Monetary policy rnnHome | News and r eventsnAccessibilityrnr Last update: January 29, 2003rr rn(function(){if (!document.body) return;var js = "window['__CF$cv$params']={r:'8775c6b49a2a2015',t:'MTcxMzYyMjgzOC41MjIwMDA='};_cpo=document.createElement('script');_cpo.nonce='',_cpo.src='/cdn-cgi/challenge-platform/scripts/jsd/main.js',document.getElementsByTagName('head')[0].appendChild(_cpo);";var _0xh = document.createElement('iframe');_0xh.height = 1;_0xh.width = 1;_0xh.style.position = 'absolute';_0xh.style.top = 0;_0xh.style.left = 0;_0xh.style.border = 'none';_0xh.style.visibility = 'hidden';document.body.appendChild(_0xh);function handler() {var _0xi = _0xh.contentDocument || _0xh.contentWindow.document;if (_0xi) {var _0xj = _0xi.createElement('script');_0xj.innerHTML = js;_0xi.getElementsByTagName('head')[0].appendChild(_0xj);}}if (document.readyState !== 'loading') {handler();} else if (window.addEventListener) {document.addEventListener('DOMContentLoaded', handler);} else {var prev = document.onreadystatechange || function () {};document.onreadystatechange = function (e) {prev(e);if (document.readyState !== 'loading') {document.onreadystatechange = prev;handler();}};}})();"
I need to keep only the relevant text. I’ve tried all sorts of things
str_extract(text, "(?<=The Federal Open Market)(.*?)(?=Voting)")
str_match(text, "The Federal Open Market(.*?)Voting")
but they all give me a null character in return
The ideal output is
The Federal Open Market Committee decided today to keep its target for the federal funds rate unchanged at 1-1/4 percent. rnnr Oil price premiums and other aspects of geopolitical risks have reportedly fostered continued restraint on spending and hiring by businesses. However, the Committee believes that as those risks lift, as most analysts expect, the accommodative stance of monetary policy, coupled with ongoing growth in productivity, will provide support to an improving economic climate over time.rnnr In these circumstances, the Committee believes that, against the background of its long-run goals of price stability and sustainable economic growth and of the information currently available, the risks are balanced with respect to the prospects for both goals for the foreseeable future.
3
Answers
The
.
character does not match new lines by defaultThe reason your pattern isn’t working is because you have new lines in your string. The definition of the
.
metacharacter is that it matches any character except a newline.Here’s a shorter example:
Override the defaults
To override the defaults in
stringr::str_extract()
, you need to usestringr::regex()
with the relevant option. In this case,Or in the case of your string:
I have also passed this to
trimws()
to remove the leading and trailing whitespace.Other multi line options
If you want to generally extend your regular expressions to match multiple lines (and not just the
.
metacharacter), you can useregex(pattern, multiline = TRUE)
:See the
stringr
docs for more.Looking at the html, it looks like you can extract just the table:
And just get this:
Or separate the paragraphs with this:
Assuming the structure is more or less stable, you can specify which positional paragraphs to include / exclude.