skip to Main Content

I am observing a strange behavior, while parsing texts from a html file using python regex. Would greatly appreciate your suggestions on regex which I should use.

string = "<a href='https://academia/course/3743'>3743</a>, <a href='https://academia/course/3963'>3963</a>,    <a href='https://academia/course/3850'>3850</a>,"
# I want to extract 3743, 3963, 3850 from the above text
pattern = r".*?<a href='.*'>([0-9]+)</a>,.*"
result = re.findall(pattern, string)
print(result)

# Output
['3850']

It is printing only the last occurence and leaving out rest. I tried following this as well, but it doesn’t help
python findall finds only the last occurrence

Can anybody please help with the regex I should use to get all the numbers

# expected output
[3743, 3963, 3850]

PS: I can’t use any other python modules like bs4. I need to stick with native python modules.

2

Answers


  1. When looking for a pattern in a string using regex and findall, you can simply put the pattern you look for in the regex. There is no need to add '.*?' before and after the actual pattern.

    The main problem of your regex is href='.*' which means that it will try to match any character in the href value as many time as possible. As a consequence, it will not stop at the first '> it encounters, but at the last one, giving you a single number as a result. You can see the behavior if you encapsulate the value in a group href='(.*)'. The actual catched pattern in href='(.*)' is https://academia/course/3743'>3743</a>, <a href='https://academia/course/3963'>3963</a>, <a href='https://academia/course/3850

    To prevent this, you must set tell the regex to match any character as few time as possible with href='.*?', the question mark being the identifier for this behavior. It will then stop at the first time it can, so the first '.

    The final code, including the regex will be:

    string = "<a href='https://academia/course/3743'>3743</a>, <a href='https://academia/course/3963'>3963</a>,    <a href='https://academia/course/3850'>3850</a>,"
    pattern = r"<a href='.*?'>([0-9]+)</a>"
    result = re.findall(pattern, string)
    print(result)
    
    # Output
    ['3743', '3963', '3850']
    
    Login or Signup to reply.
  2. You can use simple regex for the desired output.

    import re
    
    string = "<a href='https://academia/course/3743'>3743</a>, <a 
    href='https://academia/course/3963'>3963</a>,    <a 
    href='https://academia/course/3850'>3850</a>,"
    
    pattern = r"<a href='[^']*'>(d+)</a>"
    result = re.findall(pattern, string)
    
    print(result)
    

    Output:

    ['3743', '3963', '3850']
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search