I am trying to extract regex patterns from links in a Pandas table generated from the page.
The code generating the Pandas data frame is given below:
import pandas as pd
import re
url = 'https://www.espncricinfo.com/records/year/team-match-results/2005-2005/twenty20-internationals-3'
base_url = 'https://www.espncricinfo.com'
table = pd.read_html(url, extract_links = "body")[0]
table = table.apply(lambda col: [link[0] if link[1] is None else f'{base_url}{link[1]}' for link in col])
table
I want to extract the match-id from the links in the table. For each match, the match-id is the set of successive digits that succeed the ‘t20i-‘ pattern and end before the forward slash. Example:
For this match the match-id is 211048. The code which does this for a single match is given below:
scorecard_url = 'https://www.espncricinfo.com/series/australia-tour-of-new-zealand-2004-05-61407/new-zealand-vs-australia-only-t20i-211048/full-scorecard'
match_id = re.findall('t20i-(d*)/', scorecard_url)
match_id[0]
I want to do this for the table as a whole by using a derived column match-id that uses the Scorecard column. However I have not been able to.
I initially tried this simple command to do this:
table['match_id']= re.findall('t20i-(d*)/', table['Scorecard'])
table
I got a ‘TypeError: expected string or bytes-like object’ which made me think the links are not stored as strings which might be causing the issue.
I then tried:
table['match_id']= re.findall('t20i-(d*)/', str(table['Scorecard']))
table
This gave me a ‘ValueError: Length of values (0) does not match length of index (3)’ which I am not sure about.
I also tried to use the lambda function approach without success. I don’t mind using this either if it works.
2
Answers
You are close. This adds a new column with the match ID.
Output:
you can try below: using Vectorized
table["match"] = table["Scorecard"].str[1].str.extract("t20i-(d+)")