I am observing a strange behavior, while parsing texts from a html file using python regex. Would greatly appreciate your suggestions on regex which I should use.
string = "<a href='https://academia/course/3743'>3743</a>, <a href='https://academia/course/3963'>3963</a>, <a href='https://academia/course/3850'>3850</a>,"
# I want to extract 3743, 3963, 3850 from the above text
pattern = r".*?<a href='.*'>([0-9]+)</a>,.*"
result = re.findall(pattern, string)
print(result)
# Output
['3850']
It is printing only the last occurence and leaving out rest. I tried following this as well, but it doesn’t help
python findall finds only the last occurrence
Can anybody please help with the regex I should use to get all the numbers
# expected output
[3743, 3963, 3850]
PS: I can’t use any other python modules like bs4. I need to stick with native python modules.
2
Answers
When looking for a pattern in a string using regex and findall, you can simply put the pattern you look for in the regex. There is no need to add
'.*?'
before and after the actual pattern.The main problem of your regex is
href='.*'
which means that it will try to match any character in the href value as many time as possible. As a consequence, it will not stop at the first'>
it encounters, but at the last one, giving you a single number as a result. You can see the behavior if you encapsulate the value in a grouphref='(.*)'
. The actual catched pattern inhref='(.*)'
ishttps://academia/course/3743'>3743</a>, <a href='https://academia/course/3963'>3963</a>, <a href='https://academia/course/3850
To prevent this, you must set tell the regex to match any character as few time as possible with
href='.*?'
, the question mark being the identifier for this behavior. It will then stop at the first time it can, so the first'
.The final code, including the regex will be:
You can use simple regex for the desired output.
Output: