Html - Is there a way to extract regular expression patterns from links in a Pandas dataframe?

siddharthvarada
August 16, 2023
280 views
0 votes
2 Answers

I am trying to extract regex patterns from links in a Pandas table generated from the page.

The code generating the Pandas data frame is given below:

import pandas as pd
import re

url = 'https://www.espncricinfo.com/records/year/team-match-results/2005-2005/twenty20-internationals-3'
base_url = 'https://www.espncricinfo.com'

table = pd.read_html(url, extract_links = "body")[0]
table = table.apply(lambda col: [link[0] if link[1] is None else f'{base_url}{link[1]}' for link in  col])
table

I want to extract the match-id from the links in the table. For each match, the match-id is the set of successive digits that succeed the ‘t20i-‘ pattern and end before the forward slash. Example:
For this match the match-id is 211048. The code which does this for a single match is given below:

scorecard_url = 'https://www.espncricinfo.com/series/australia-tour-of-new-zealand-2004-05-61407/new-zealand-vs-australia-only-t20i-211048/full-scorecard'
match_id = re.findall('t20i-(d*)/', scorecard_url)
match_id[0]

I want to do this for the table as a whole by using a derived column match-id that uses the Scorecard column. However I have not been able to.

I initially tried this simple command to do this:

table['match_id']= re.findall('t20i-(d*)/', table['Scorecard'])
table

I got a ‘TypeError: expected string or bytes-like object’ which made me think the links are not stored as strings which might be causing the issue.

I then tried:

table['match_id']= re.findall('t20i-(d*)/', str(table['Scorecard']))
table

This gave me a ‘ValueError: Length of values (0) does not match length of index (3)’ which I am not sure about.

I also tried to use the lambda function approach without success. I don’t mind using this either if it works.

Answers

You are close. This adds a new column with the match ID.

import pandas as pd
import re

url = 'https://www.espncricinfo.com/records/year/team-match-results/2005-2005/twenty20-internationals-3'
base_url = 'https://www.espncricinfo.com'

def match(row):
    match_id = re.findall('t20i-(d*)/', row[1])
    return match_id[0]
    
table = pd.read_html(url, extract_links = "body")[0]
table['match'] = table['Scorecard'].apply(match)
print(table)

Output:

                 Team 1  ...   match
0   (New Zealand, None)  ...  211048
1       (England, None)  ...  211028
2  (South Africa, None)  ...  222678

[3 rows x 8 columns]

you can try below: using Vectorized
table["match"] = table["Scorecard"].str[1].str.extract("t20i-(d+)")

In [230]: table["match"] = table["Scorecard"].str[1].str.extract("t20i-(d+)")

In [231]: table
Out[231]:
                 Team 1               Team 2               Winner  ...            Match Date                                          Scorecard   match
0   (New Zealand, None)    (Australia, None)    (Australia, None)  ...  (Feb 17, 2005, None)  (T20I # 1, /series/australia-tour-of-new-zeala...  211048
1       (England, None)    (Australia, None)      (England, None)  ...  (Jun 13, 2005, None)  (T20I # 2, /series/australia-tour-of-england-a...  211028
2  (South Africa, None)  (New Zealand, None)  (New Zealand, None)  ...  (Oct 21, 2005, None)  (T20I # 3, /series/new-zealand-tour-of-south-a...  222678

Please signup or login to give your own answer.

Click here to cancel reply.

Html – Is there a way to extract regular expression patterns from links in a Pandas dataframe?

Answers