Extract SVG Elements and Title from HTML using BeautifulSoup

felix1k
June 9, 2024
136 views
0 votes
2 Answers

I have a chunk of html I have extracted using BeautifulSoup which containts an SVG path and a title.

I am wondering if there is a way to extract the title text and the coordinates from this chunk.

<g class="myclass"><title>mytitle</title><path d="M -5, 0 a 5,5 0 1,0 10,0 a 5,5 0 1,0 -10,0" fill="rgba(255,255,255, 0.1" stroke="rgba(10, 158, 117, 0.8)" stroke-width="3" transform="translate(178.00000000000003, 201)"></path></g>

Below is the function which returns the above chunk (rows)

def scrape_spatial_data(page):
    html = page.inner_html("body")
    soup = BeautifulSoup(html, "html.parser")
    rows = soup.select("g.myclass")
    return rows

Is there a cleaner way of extracting this using BeautifulSoup?

Answers

A reasonable approach to this would be to return a dictionary that is keyed on the attribute names (from the path tag) and the title.

I recommend using lxml (if available) as the HTML parser for enhanced performance.

Try this:

from playwright.sync_api import Page
from bs4 import BeautifulSoup as BS
from functools import cache

@cache
def get_parser():
    try:
        import lxml
        return "lxml"
    except Exception:
        pass
    return "html.parser"

def scrape_spatial_data(page: Page) -> dict[str, str]:
    html = page.inner_html("body")
    soup = BS(html, get_parser())
    result = {}
    if row := soup.select_one("g.myclass"):
        if t := row.select_one("title"):
            result["title"] = t.text
        if p := row.select_one("path"):
            result.update(p.attrs)
    return result

For the data shown in the question this would reveal:

{
  "title": "mytitle",
  "d": "M -5, 0 a 5,5 0 1,0 10,0 a 5,5 0 1,0 -10,0",
  "fill": "rgba(255,255,255, 0.1",
  "stroke": "rgba(10, 158, 117, 0.8)",
  "stroke-width": "3",
  "transform": "translate(178.00000000000003, 201)"
}

- ggorlen
- June 9, 2024 at 3:41 pm
- 0 votes
0
If you’re using Playwright, there’s no need to use BeautifulSoup as well–dumping the live DOM to a string just to feed into BS is an antipattern and prone to error. Playwright has a full suite of locators and selectors that work on the live DOM and even auto-wait for actionability before operation.

So, assuming you’re automating a single page app, save yourself the dependency, extra syntax, performance hit, and potential bugs, and stick to locators throughout:
```
from playwright.sync_api import sync_playwright # 1.44.0


svg = """<Your example SVG from the question, verbatim>"""


with sync_playwright() as p:
    browser = p.chromium.launch()
    page = browser.new_page()
    page.set_content(svg) # for demonstration, normally use goto
    title = page.locator(".myclass title").text_content()
    coordinates = page.locator(".myclass path").get_attribute("transform")
    print(title, coordinates)
    browser.close()
```
If you want to extract the values from the translate() arguments, you can use:
```
x, y = [float(x) for x in re.findall(r"[d.]+", coordinates)]
```
If you want all attributes from the path element, you can use JS in the browser natively:
```
attrs = page.locator(".myclass path").evaluate("""
    el => Object.fromEntries([...el.attributes].map(e => [e.name, e.value]))
""")
```
On the other hand, if you’re scraping a static HTML page, consider dropping Playwright in favor of requests + BeautifulSoup.
Login or Signup to reply.

Please signup or login to give your own answer.

Click here to cancel reply.