I need to parse the style attribute from an html corpus (contains many different html entries with many different style attribute).
An example of HTML may be the following :
<span style="font-size:0.58em;font-family:'Times New Roman';">
<span style='font-size:0.58em;font-family:"Times New Roman";'>
So style attribute content is some text wrapped between single (‘) or double (") quotes.
If the text start to be wrapped between single quote, then should be readen until closing single quote is met. If start with a double quote, should proceede until the closing double quote is met.
I have produced the following regex, that work quite well :
/styles*=s*(?:'|").+?(?:'|")/gmi
The problem is that my solution fail to check consistency between opening quote and closing quote, so it will produce solution like :
style="font-size:0.58em;font-family:'Times New Roman' --missing--> ;"
style='font-size:0.58em;font-family:"Times New Roman" --missing--> ;'
Is there a solution to check both cases with one regex, or the only option is to split the current regex in two regex that will check for single or double quotes?
2
Answers
You can use previous match to do it.
https://regex101.com/
Did you think about use a HTML DOM parser like this one node-html-parser ?
You can solve it using back reference.
Regex:
check this result.