How to extract the value of an apparently non-standard html tag in r

IvanBezerraAllaman
September 19, 2023
95 views
0 votes
2 Answers

I have the following summarized html code (html_file.html).

<!DOCTYPE html>
<html>
<head>
<title>Page Title</title>
</head>
<body>
<div class="listing-wrapper__content">
<section class="card__amenities ">
<p class="l-text l-u-color-neutral-28 l-text--variant-body-small l-text--weight-regular card__amenity" itemprop="floorSize"><span data-testid="l-icon" role="document" aria-label="Tamanho do imóvel" class="l-icon l-u-color-undefined"><svg viewBox="0 0 24 24" fill="none" xmlns="http://www.w3.org/2000/svg">...</svg></span> 94 - 100 m² </p>
<p class="l-text l-u-color-neutral-28 l-text--variant-body-small l-text--weight-regular card__amenity" itemprop="numberOfRooms"><span data-testid="l-icon" role="document" aria-label="Quantidade de quartos" class="l-icon l-u-color-undefined"><svg viewBox="0 0 24 24" fill="none" xmlns="http://www.w3.org/2000/svg">...</svg></span> 3 </p>
<p class="l-text l-u-color-neutral-28 l-text--variant-body-small l-text--weight-regular card__amenity" itemprop="numberOfBathroomsTotal"<span data-testid="l-icon" role="document" aria-label="Quantidade de banheiros" class="l-icon l-u-color-undefined"><svg viewBox="0 0 24 24" fill="none" xmlns="http://www.w3.org/2000/svg">...</svg></span>3</p>
<p class="l-text l-u-color-neutral-28 l-text--variant-body-small l-text--weight-regular card__amenity"><span data-testid="l-icon" role="document" aria-label="Quantidade de vagas de garagem" class="l-icon l-u-color-undefined"><svg viewBox="0 0 24 24" fill="none" xmlns="http://www.w3.org/2000/svg"><...</svg></span>2</p>
</section>
</div>
</body>
</html>

I managed to extract the first three elements. For example:

library(rvest)
pagee <- read_html("html_file.html") 
nofrooms <- html_elements(pagee, ".listing-wrapper__content")%>%html_nodes("[itemprop='numberOfRooms']")%>%html_text()
nofrooms

Output is

" 3 "

The problem is in the last p tag. There is apparently no standard for me to be able to extract information from such a tag. I have tried the following without success:

nofgarage <- html_elements(pagee, ".listing-wrapper__content")%>%html_nodes("[aria-label='Quantidade de vagas de garagem']")%>%html_text()
nofgarage

Output is

""

The result is empty as expected, as the value I want to extract is not between the span tags.

Thanks for any help

Tags: html r rvest web-scraping

Answers

Regarding your example code and assuming that you only want to extract the number in the end, we could use a workaround with the xpath argument and exclude everything inside the <svg> tag and then purrr::discard all empty strings:

library(rvest)
library(purrr)

html |> 
  read_html(html) |> 
  html_elements("p") |>
  html_nodes(xpath='//*[not(name()="svg")]/text()') |> 
  html_text(trim=TRUE) |> 
  purrr::discard((x) x == "")
#> [1] "94 - 100 m²" "3"           "3"           "2"

Data from OP

html <- '<section class="card__amenities ">
<p class="l-text l-u-color-neutral-28 l-text--variant-body-small l-text--weight-regular card__amenity" itemprop="floorSize"><span data-testid="l-icon" role="document" aria-label="Tamanho do imóvel" class="l-icon l-u-color-undefined"><svg viewBox="0 0 24 24" fill="none" xmlns="http://www.w3.org/2000/svg">...</svg></span> 94 - 100 m² </p>
<p class="l-text l-u-color-neutral-28 l-text--variant-body-small l-text--weight-regular card__amenity" itemprop="numberOfRooms"><span data-testid="l-icon" role="document" aria-label="Quantidade de quartos" class="l-icon l-u-color-undefined"><svg viewBox="0 0 24 24" fill="none" xmlns="http://www.w3.org/2000/svg">...</svg></span> 3 </p>
<p class="l-text l-u-color-neutral-28 l-text--variant-body-small l-text--weight-regular card__amenity" itemprop="numberOfBathroomsTotal"<span data-testid="l-icon" role="document" aria-label="Quantidade de banheiros" class="l-icon l-u-color-undefined"><svg viewBox="0 0 24 24" fill="none" xmlns="http://www.w3.org/2000/svg">...</svg></span>3</p>
<p class="l-text l-u-color-neutral-28 l-text--variant-body-small l-text--weight-regular card__amenity"><span data-testid="l-icon" role="document" aria-label="Quantidade de vagas de garagem" class="l-icon l-u-color-undefined"><svg viewBox="0 0 24 24" fill="none" xmlns="http://www.w3.org/2000/svg"><...</svg></span>2</p>
</section>'

^{Created on 2023-09-15 with reprex v2.0.2}

Since it appears the that there is mostly 4 amenities, one could use xml_child() function from xml2 to select the that amenity.
In this case there are a few listing that is missing the 4th amenity so we need to filter before attempting to extract.
See comments below.

library(rvest)
library(xml2)
library(dplyr)

url <- "https://www.zapimoveis.com.br/venda/apartamentos/ms+campo-grande/?transacao=venda&onde=,Mato%20Grosso%20do%20Sul,Campo%20Grande,,,,,city,BR%3EMato%20Grosso%20do%20Sul%3ENULL%3ECampo%20Grande,-20.464852,-54.621848,&tipos=apartamento_residencial&pagina=1"

#read page
pagee <- read_html(url)

#get the amentities section from each listing
sections <- html_elements(pagee, "section.card__amenities ")
#section %>% html_elements("p") %>% html_text()

#create an empty vector
garages <- vector("numeric", length=length(section))

#retrieve the 4 node value - not all apartments have a 4 values thus the need to filter
garages[xml_length(section)==4] <- sapply(section[xml_length(section)==4], function(node) 
                                   {xml_child(node, 4) %>% html_text()})

#answer the final vector
garages
# [1] "2" "4" "1" "1" "1" "1" "0" "2" "2" "2" "3" "1" "1" "1" "0"

Please signup or login to give your own answer.

Click here to cancel reply.