skip to Main Content

I am trying to extract regex patterns from links in a Pandas table generated from the page.

The code generating the Pandas data frame is given below:

import pandas as pd
import re

url = 'https://www.espncricinfo.com/records/year/team-match-results/2005-2005/twenty20-internationals-3'
base_url = 'https://www.espncricinfo.com'

table = pd.read_html(url, extract_links = "body")[0]
table = table.apply(lambda col: [link[0] if link[1] is None else f'{base_url}{link[1]}' for link in  col])
table

I want to extract the match-id from the links in the table. For each match, the match-id is the set of successive digits that succeed the ‘t20i-‘ pattern and end before the forward slash. Example:
For this match the match-id is 211048. The code which does this for a single match is given below:

scorecard_url = 'https://www.espncricinfo.com/series/australia-tour-of-new-zealand-2004-05-61407/new-zealand-vs-australia-only-t20i-211048/full-scorecard'
match_id = re.findall('t20i-(d*)/', scorecard_url)
match_id[0]

I want to do this for the table as a whole by using a derived column match-id that uses the Scorecard column. However I have not been able to.

I initially tried this simple command to do this:

table['match_id']= re.findall('t20i-(d*)/', table['Scorecard'])
table

I got a ‘TypeError: expected string or bytes-like object’ which made me think the links are not stored as strings which might be causing the issue.

I then tried:

table['match_id']= re.findall('t20i-(d*)/', str(table['Scorecard']))
table

This gave me a ‘ValueError: Length of values (0) does not match length of index (3)’ which I am not sure about.

I also tried to use the lambda function approach without success. I don’t mind using this either if it works.

2

Answers


  1. You are close. This adds a new column with the match ID.

    import pandas as pd
    import re
    
    url = 'https://www.espncricinfo.com/records/year/team-match-results/2005-2005/twenty20-internationals-3'
    base_url = 'https://www.espncricinfo.com'
    
    def match(row):
        match_id = re.findall('t20i-(d*)/', row[1])
        return match_id[0]
        
    table = pd.read_html(url, extract_links = "body")[0]
    table['match'] = table['Scorecard'].apply(match)
    print(table)
    

    Output:

                     Team 1  ...   match
    0   (New Zealand, None)  ...  211048
    1       (England, None)  ...  211028
    2  (South Africa, None)  ...  222678
    
    [3 rows x 8 columns]
    
    Login or Signup to reply.
  2. you can try below: using Vectorized
    table["match"] = table["Scorecard"].str[1].str.extract("t20i-(d+)")

    In [230]: table["match"] = table["Scorecard"].str[1].str.extract("t20i-(d+)")
    
    In [231]: table
    Out[231]:
                     Team 1               Team 2               Winner  ...            Match Date                                          Scorecard   match
    0   (New Zealand, None)    (Australia, None)    (Australia, None)  ...  (Feb 17, 2005, None)  (T20I # 1, /series/australia-tour-of-new-zeala...  211048
    1       (England, None)    (Australia, None)      (England, None)  ...  (Jun 13, 2005, None)  (T20I # 2, /series/australia-tour-of-england-a...  211028
    2  (South Africa, None)  (New Zealand, None)  (New Zealand, None)  ...  (Oct 21, 2005, None)  (T20I # 3, /series/new-zealand-tour-of-south-a...  222678
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search