skip to Main Content

I am learning how to use the Reddit API – I am trying to learn how to extract all comments from a specific post.

For example – consider this post:https://www.reddit.com/r/Homebrewing/comments/11dd5r3/worst_mistake_youve_made_as_a_homebrewer/

Using this R code, I think I was able to access the comments:

library(httr)
library(jsonlite)

# Set authentication parameters
auth <- authenticate("some-key1", "some_key2")

# Set user agent
user_agent <- "my_app/0.1"

# Get access token
response <- POST("https://www.reddit.com/api/v1/access_token",
                 auth = auth,
                 user_agent = user_agent,
                 body = list(grant_type = "password",
                             username = "abc123",
                             password = "123abc"))

# Extract access token from response
access_token <- content(response)$access_token

# Use access token to make API request
url <- "https://oauth.reddit.com/LISTING" # Replace "LISTING" with the subreddit or endpoint you want to access

headers <- c("Authorization" = paste("Bearer", access_token))
result <- GET(url, user_agent(user_agent), add_headers(headers))

post_id <- "11dd5r3"
url <- paste0("https://oauth.reddit.com/r/Homebrewing/comments/", post_id)

# Set the user agent string 
user_agent_string <- "MyApp/1.0"

# Set the authorization header 
authorization_header <- paste("Bearer ", access_token, sep = "")

# Make the API request 
response <- GET(url, add_headers(Authorization = authorization_header, `User-Agent` = user_agent_string))

# Extract the response content and parse 
response_json <- rawToChar(response$content)

From here, it looks like all comments are stored between a set of <p> and </p>:

  • <p>Reminds me of a chemistry professor I had in college, he taught a class on polymers (really smart guy, Nobel prize voter level). When talking about glass transition temperature he suddenly stopped and told a story about how a week or two beforehand he had put some styrofoam into the oven to keep the food warm while he waited for his wife to get home. It melted and that was his example on glass transition temperature. Basically: no matter how smart or trained you are, you can still make a mistake.</p>

  • <p>opening the butterfly valve on the bottom of a pressurized FV with a peanut butter chocolate milk stout in it. Made the inside of my freezer look like someone diarrhea'd all over the inside of the door.</p>

Using this logic, I tried to only keep text between these symbols via Regex:

final = response_json[1]
matches <- gregexpr("<p>(.*?)</p>", final)
matches_text <- regmatches(final, matches)[[1]]

I think this code partly worked – but many entries were returned that were not comments:

[212] "<p>Worst mistake was buying malt hops and yeast and letting it go stale.</p>"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      
[213] "<p>Posts are automatically archived after 6 months.</p>"

Can someone please show me a better way of doing this? How can I only extract the comment text and nothing else?

Thanks!

  • Note : I am not sure if this code will extract ALL comments on a post or just some comments – and if there is a way to change this.

2

Answers


  1. If you want to use regex anyway, probably you should try a pattern like (?<=<p>).*?(?=</p>), e.g.,

    > s <- "<p>xxxxx</p> <p>xyyyyyyyyy</p> <p>zzzzzzzzzzzz</p>"
    
    > regmatches(s, gregexpr("(?<=<p>).*?(?=</p>)", s, perl = TRUE))[[1]]
    [1] "xxxxx"        "xyyyyyyyyy"   "zzzzzzzzzzzz"
    
    Login or Signup to reply.
  2. Assuming that the API response is in JSON format, you can use the jsonlite package in R to convert the JSON response into a data frame, and then extract the comments from the data frame using regular expressions.

    library(jsonlite)
    

    API response in JSON format

    response <- '{"comments":[{"name":"John","email":"[email protected]","body":"This is a comment."},{"name":"Jane","email":"[email protected]","body":"Another comment."}]}'
    

    Convert JSON response into a data frame

    df <- jsonlite::fromJSON(response, simplifyDataFrame = TRUE)
    

    Extract comments using regular expressions

    comments <- df$body
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search