I have a chunk of html I have extracted using BeautifulSoup which containts an SVG path and a title.
I am wondering if there is a way to extract the title text and the coordinates from this chunk.
<g class="myclass"><title>mytitle</title><path d="M -5, 0 a 5,5 0 1,0 10,0 a 5,5 0 1,0 -10,0" fill="rgba(255,255,255, 0.1" stroke="rgba(10, 158, 117, 0.8)" stroke-width="3" transform="translate(178.00000000000003, 201)"></path></g>
Below is the function which returns the above chunk (rows
)
def scrape_spatial_data(page):
html = page.inner_html("body")
soup = BeautifulSoup(html, "html.parser")
rows = soup.select("g.myclass")
return rows
Is there a cleaner way of extracting this using BeautifulSoup?
2
Answers
A reasonable approach to this would be to return a dictionary that is keyed on the attribute names (from the path tag) and the title.
I recommend using lxml (if available) as the HTML parser for enhanced performance.
Try this:
For the data shown in the question this would reveal:
If you’re using Playwright, there’s no need to use BeautifulSoup as well–dumping the live DOM to a string just to feed into BS is an antipattern and prone to error. Playwright has a full suite of locators and selectors that work on the live DOM and even auto-wait for actionability before operation.
So, assuming you’re automating a single page app, save yourself the dependency, extra syntax, performance hit, and potential bugs, and stick to locators throughout:
If you want to extract the values from the
translate()
arguments, you can use:If you want all attributes from the path element, you can use JS in the browser natively:
On the other hand, if you’re scraping a static HTML page, consider dropping Playwright in favor of requests + BeautifulSoup.