skip to Main Content

I tried to convert HTML to Dataframe using pd.read_html but it automatically removes the multiple spaces in between the text.

For Example:

<table>
<tr>
   <td>35</td>
   <td>Juan  Cruz</td>
</tr>
<tr>
   <td>23</td>
   <td>Philip   Philips</td>
</tr>

Expected Output:

[  0             1
   35       Juan  Cruz
   23       Philip   Philips]

Actual Output:

[  0             1
   35        Juan Cruz
   23        Philip Philips]

Any recommendation to keep extra spaces?

I tried to search for the alternative with the same output, but I can’t find one. I also tried to change the space to   then convert but it didn’t work, and it takes a lot of time if I convert every space into tokenize character and convert it back. Any recommendation, can I use some regex on this problem or any alternative solution?

3

Answers


  1. I’m not sure if you can tell pandas to keep those extra whitespaces.

    As a workaround, you can use :

    #pip install beautifulsoup4
    from bs4 import BeautifulSoup
    ​
    soup = BeautifulSoup(html, "html5lib")
    ​
    df = pd.DataFrame(
        [[cell.get_text() for cell in row.find_all("td")] for row in soup.find_all("tr")]
    )
    


    Ouptut :

    print(df)
    
        0                 1
    0  35        Juan  Cruz
    1  23  Philip   Philips
    
    Login or Signup to reply.
  2. You can override _remove_whitespace function:

    def _remove_whitespace(s, regex=None):
        return s.strip()
    pd.io.html._remove_whitespace = _remove_whitespace
    

    Output:

    >>> pd.read_html(html)
    
    [    0                 1
     0  35        Juan  Cruz
     1  23  Philip   Philips]
    
    Login or Signup to reply.
  3. When converting an HTML table to DataFrame using pd.read_html() in pandas, many parts of the text will be removed by default. If you want to save more space, you can try the following:

    1. Option 1: Use regular expressions (regex) to store location: After deleting
      DataFrame with 'pd.read_html()' you can use regular expressions to replace specific structure with space. For example:

      import re
      
      # Assuming df is the DataFrame obtained from pd.read_html()
      df = df.replace(r's{2,}', ' ', regex=True)
      
    2. Option 2: Use a custom function to render the HTML:
      Instead of relying solely on 'pd.read_html()' to store the settings, you can write a custom function using the library using different HTML (like BeautifulSoup) to extract the data. . Here is an example:

      from bs4 import BeautifulSoup
      import pandas as pd
      
      # Assuming html_content is the HTML content as a string
      soup = BeautifulSoup(html_content, 'html.parser')
      table = soup.find('table')
      
      # Extract the table data and preserve spaces
      rows = []
      for row in table.find_all('tr'):
          rows.append([cell.get_text() for cell in row.find_all('td')])
      df = pd.DataFrame(rows)
      
      # Optional: Replace multiple spaces within DataFrame columns
      df = df.replace(r's{2,}', ' ', regex=True)
      

    These options allow you to retain the extra spaces in your DataFrame. Choose the one that suits your needs and integrate it into your code accordingly.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search