Have read_html read *cell content* and *text bubble* separately, instead of concatenate them

OCa
October 21, 2023
142 views
1 vote
2 Answers

This site page has text bubbles appearing when hovering over values in columns "Score" and "XP LVL".

It appears that read_html will concatenate cell content and text bubble. Splitting those in post-processing is not always obvious and I seek a way to have read_html handle them separately, possibly return them as two columns.

This is how the first row appears online:

(Rank)#	Name	Score	XP LVL	Victories / Total	Victory Ratio
1	Rainin☆☆☆☆	6129	447	408 / 531	76%

where "Score"‘s "6129" carries bubble "Max6129"
where, more annoyingly, "XP LVL"‘s "447" carries bubble "21173534 pts"

This is how it appears after reading:

pd.read_html('https://stats.gladiabots.com/pantheon?', header=0, flavor="html5lib")[0]

        #            Name         Score           XP LVL Victories / Total  
0       1      Rainin☆☆☆☆  6129Max 6129  44721173534 pts         408 / 531

See "44721173534 pts" is the concatenation of "447" and "21173534 pts". "XP LVL" values have a variable number of digits, so splitting the string in the post-processing phase would require being pretty smart about it and I woud like to explore the "let read_html do the split", first.

(The special flavor="html5lib" was added because the page is dynamically-generated)

I have not found any mention of text bubbles in the docs

Answers

You can use beautifulsoup to parse the page and then create the dataframe:

import pandas as pd
import requests
from bs4 import BeautifulSoup

url = "https://stats.gladiabots.com/pantheon"
soup = BeautifulSoup(requests.get(url).content, "html5lib")

all_data = []
for tr in soup.table.select("tr:has(td)"):
    all_data.append([])
    for td in tr.select("td"):
        all_data[-1].extend(td.get_text(strip=True, separator="###").split("###"))

df = pd.DataFrame(
    all_data, columns=["#", "Name", "Score", "Score2", "XP LVL", "PTS", "V/T", "Ratio"]
)
print(df.head())

Prints:

   #          Name Score    Score2 XP LVL           PTS          V/T Ratio
0  1    Rainin☆☆☆☆  6129  Max 6129    447  21173534 pts    408 / 531   76%
1  2      ZM_XL☆☆☆  5888  Max 6025    344  15942978 pts  3685 / 6748   54%
2  3   UzuraGames☆  5555  Max 5586    119   4688941 pts   610 / 1109   55%
3  4  Markolainen☆  5521  Max 5612    113   4433827 pts   763 / 1255   60%
4  5     Defunct☆☆  5337  Max 5452    225   9999855 pts  1535 / 3066   50%

It turns out that this is because pandas uses the .text attribute of the <td> bs4.element.Tag objects and this one concatenate (without any separator) the texts of all the tag’s children.

In the first row of the table, the score has two children 6129 and Max 6129, thus the concat.

<td nowrap="" class="barContainer">
  <div class="scoreBar" style="width: 100%;"></div>
  <div class="maxScoreBar" style="width: 0%;"></div>
  <span class="barLabel tooltipable">
    "6129"
    <span class="tooltip">
      "Max 6129"
    </span>
  </span>
</td>

A quick/hacky solution would be to override the _text_getter method of the parser used by pandas and replace .text with get_text that has a separator parameter :

def _text_getter(self, obj):
    return obj.get_text(separator="_", strip=True) # I choosed "_"

pd.io.html._BeautifulSoupHtml5LibFrameParser._text_getter = _text_getter

With this modification, read_html gives this df :

        #            Name          Score            XP LVL Victories / Total Victory_Ratio
0       1      Rainin☆☆☆☆  6129_Max 6129  447_21173534 pts         408 / 531           76%
1       2        ZM_XL☆☆☆  5888_Max 6025  344_15942978 pts       3685 / 6748           54%
2       3     UzuraGames☆  5555_Max 5586   119_4688941 pts        610 / 1109           55%
..    ...             ...            ...               ...               ...           ...
997   998          Tekuma  3183_Max 3460     27_370585 pts         151 / 304           49%
998   999            hemi  3183_Max 3227      10_49432 pts           29 / 62           46%
999  1000  wanna bet kid?  3183_Max 3304      13_85777 pts           51 / 95           53%

[1000 rows x 6 columns]

And this way, you can extract / disattach the values of the two concerned columns :

scores = df.pop("Score").str.extract(r"(?P<Score>d+)_Max (?P<Max>d+)")
xplvls = df.pop("XP LVL").str.extract(r"(?P<XPLVL>d+)_(?P<PTS>d+)")

out = pd.concat([df, scores, xplvls], axis=1)

Output :

print(out) # with only `scores` and `xplvls`

    Score   Max XPLVL       PTS
0    6129  6129   447  21173534
1    5888  6025   344  15942978
2    5555  5586   119   4688941
..    ...   ...   ...       ...
997  3183  3460    27    370585
998  3183  3227    10     49432
999  3183  3304    13     85777

[1000 rows x 4 columns]

Please signup or login to give your own answer.

Click here to cancel reply.