skip to Main Content

I need to parse the style attribute from an html corpus (contains many different html entries with many different style attribute).

An example of HTML may be the following :

<span style="font-size:0.58em;font-family:'Times New Roman';">
<span style='font-size:0.58em;font-family:"Times New Roman";'>

So style attribute content is some text wrapped between single (‘) or double (") quotes.
If the text start to be wrapped between single quote, then should be readen until closing single quote is met. If start with a double quote, should proceede until the closing double quote is met.

I have produced the following regex, that work quite well :

/styles*=s*(?:'|").+?(?:'|")/gmi

The problem is that my solution fail to check consistency between opening quote and closing quote, so it will produce solution like :

style="font-size:0.58em;font-family:'Times New Roman'  --missing--> ;"
style='font-size:0.58em;font-family:"Times New Roman"  --missing--> ;'

Is there a solution to check both cases with one regex, or the only option is to split the current regex in two regex that will check for single or double quotes?

2

Answers


  1. You can use previous match to do it.

    https://regex101.com/

    Usually referred to as a backreference, this will match a repeat of
    the text matched and captured by the capture group # (number)
    specified.

    To reduce ambiguity one may also use g#, or g{#} where # is a
    digit. /(.)1/

    styles*=s*('|")(.+?)1
    

    Did you think about use a HTML DOM parser like this one node-html-parser ?

    Login or Signup to reply.
  2. You can solve it using back reference.

    Regex:

    styles*=s*(["'])(.*)1
    

    check this result.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search