I tried to convert HTML
to Dataframe using pd.read_html
but it automatically removes the multiple spaces in between the text.
For Example:
<table>
<tr>
<td>35</td>
<td>Juan Cruz</td>
</tr>
<tr>
<td>23</td>
<td>Philip Philips</td>
</tr>
Expected Output:
[ 0 1
35 Juan Cruz
23 Philip Philips]
Actual Output:
[ 0 1
35 Juan Cruz
23 Philip Philips]
Any recommendation to keep extra spaces?
I tried to search for the alternative with the same output, but I can’t find one. I also tried to change the space to then convert but it didn’t work, and it takes a lot of time if I convert every space into tokenize character and convert it back. Any recommendation, can I use some regex on this problem or any alternative solution?
3
Answers
I’m not sure if you can tell pandas to keep those extra whitespaces.
As a workaround, you can use beautifulsoup :
Ouptut :
You can override
_remove_whitespace
function:Output:
When converting an HTML table to DataFrame using pd.read_html() in pandas, many parts of the text will be removed by default. If you want to save more space, you can try the following:
Option 1: Use regular expressions (regex) to store location: After deleting
DataFrame with
'pd.read_html()'
you can use regular expressions to replace specific structure with space. For example:Option 2: Use a custom function to render the HTML:
Instead of relying solely on
'pd.read_html()'
to store the settings, you can write a custom function using the library using different HTML (like BeautifulSoup) to extract the data. . Here is an example:These options allow you to retain the extra spaces in your DataFrame. Choose the one that suits your needs and integrate it into your code accordingly.