Can I keep extra spaces when using pd.read_html() to convert HTML to Dataframe?

RyukiAquino
May 26, 2023
270 views
3 votes
3 Answers

I tried to convert HTML to Dataframe using pd.read_html but it automatically removes the multiple spaces in between the text.

For Example:

<table>
<tr>
   <td>35</td>
   <td>Juan  Cruz</td>
</tr>
<tr>
   <td>23</td>
   <td>Philip   Philips</td>
</tr>

Expected Output:

[  0             1
   35       Juan  Cruz
   23       Philip   Philips]

Actual Output:

[  0             1
   35        Juan Cruz
   23        Philip Philips]

Any recommendation to keep extra spaces?

I tried to search for the alternative with the same output, but I can’t find one. I also tried to change the space to then convert but it didn’t work, and it takes a lot of time if I convert every space into tokenize character and convert it back. Any recommendation, can I use some regex on this problem or any alternative solution?

Answers

- Timeless
- May 26, 2023 at 2:15 pm
- 0 votes
0
I’m not sure if you can tell pandas to keep those extra whitespaces.

As a workaround, you can use beautifulsoup :
```
#pip install beautifulsoup4
from bs4 import BeautifulSoup

soup = BeautifulSoup(html, "html5lib")

df = pd.DataFrame(
    [[cell.get_text() for cell in row.find_all("td")] for row in soup.find_all("tr")]
)
```
Ouptut :
```
print(df)

    0                 1
0  35        Juan  Cruz
1  23  Philip   Philips
```
Login or Signup to reply.

- Corralien
- May 26, 2023 at 2:24 pm
- 0 votes
0
You can override _remove_whitespace function:
```
def _remove_whitespace(s, regex=None):
    return s.strip()
pd.io.html._remove_whitespace = _remove_whitespace
```
Output:
```
>>> pd.read_html(html)

[    0                 1
 0  35        Juan  Cruz
 1  23  Philip   Philips]
```
Login or Signup to reply.

- ImYourDEV
- May 26, 2023 at 3:08 pm
- 0 votes
0
When converting an HTML table to DataFrame using pd.read_html() in pandas, many parts of the text will be removed by default. If you want to save more space, you can try the following:
1. Option 1: Use regular expressions (regex) to store location: After deleting
  DataFrame with 'pd.read_html()' you can use regular expressions to replace specific structure with space. For example:
```
import re

# Assuming df is the DataFrame obtained from pd.read_html()
df = df.replace(r's{2,}', ' ', regex=True)
```
2. Option 2: Use a custom function to render the HTML:
  Instead of relying solely on 'pd.read_html()' to store the settings, you can write a custom function using the library using different HTML (like BeautifulSoup) to extract the data. . Here is an example:
```
from bs4 import BeautifulSoup
import pandas as pd

# Assuming html_content is the HTML content as a string
soup = BeautifulSoup(html_content, 'html.parser')
table = soup.find('table')

# Extract the table data and preserve spaces
rows = []
for row in table.find_all('tr'):
    rows.append([cell.get_text() for cell in row.find_all('td')])
df = pd.DataFrame(rows)

# Optional: Replace multiple spaces within DataFrame columns
df = df.replace(r's{2,}', ' ', regex=True)
```
These options allow you to retain the extra spaces in your DataFrame. Choose the one that suits your needs and integrate it into your code accordingly.
Login or Signup to reply.

Please signup or login to give your own answer.

Click here to cancel reply.