This site page has text bubbles appearing when hovering over values in columns "Score"
and "XP LVL"
.
It appears that read_html
will concatenate cell content and text bubble. Splitting those in post-processing is not always obvious and I seek a way to have read_html
handle them separately, possibly return them as two columns.
This is how the first row appears online:
(Rank)# | Name | Score | XP LVL | Victories / Total | Victory Ratio |
---|---|---|---|---|---|
1 | Rainin☆☆☆☆ | 6129 | 447 | 408 / 531 | 76% |
- where
"Score"
‘s "6129" carries bubble "Max6129" - where, more annoyingly,
"XP LVL"
‘s "447" carries bubble "21173534 pts"
This is how it appears after reading:
pd.read_html('https://stats.gladiabots.com/pantheon?', header=0, flavor="html5lib")[0]
# Name Score XP LVL Victories / Total
0 1 Rainin☆☆☆☆ 6129Max 6129 44721173534 pts 408 / 531
See "44721173534 pts" is the concatenation of "447" and "21173534 pts". "XP LVL"
values have a variable number of digits, so splitting the string in the post-processing phase would require being pretty smart about it and I woud like to explore the "let read_html do the split", first.
(The special flavor="html5lib" was added because the page is dynamically-generated)
I have not found any mention of text bubbles in the docs
2
Answers
You can use
beautifulsoup
to parse the page and then create the dataframe:Prints:
It turns out that this is because pandas uses the
.text
attribute of the<td>
bs4.element.Tag
objects and this one concatenate (without any separator) the texts of all the tag’s children.In the first row of the table, the score has two children
6129
andMax 6129
, thus the concat.A quick/hacky solution would be to override the
_text_getter
method of the parser used by pandas and replace.text
withget_text
that has aseparator
parameter :With this modification,
read_html
gives thisdf
:And this way, you can
extract
/ disattach the values of the two concerned columns :Output :