skip to Main Content

I have a Dataframe with urls. I have a blacklist with words to filter these urls.
No I want to filter these urls until the third occurence of /.
So for example:

http://example.com/abc/def/

Here I would like to filter only until the third occurence of /.

So just:
http://example.com/

I read some similiar questions and I guess I need to combine two regexes.

  1. /.*?/(.*?)/ this should do the job for filtering until the third occurence of /

  2. to filter the for a list of words I use this expression:

mask = df["url"].str.contains(r'b(?:{})b'.format('|'.join(blacklist)))
df_new = df[~mask]

Now I don’t know how to combine these two expressions. I’m new to Python and especially regex so there also might be a smarter way of doing this task.

Thank you.

EDIT:
Blacklist looks like this: ["ebay","shop","camping","car"]

Df like this:

url                             text
http://example.com/abc/def/     fdogjdfgfd
http://abcde.com/yzt/egd/        oijfgfdgdf
http://ebay.com/buy/something    fgfgeg

2

Answers


  1. Use, Series.str.contains with the given regex pattern:

    pattern = '|'.join(rf'(?://[^/]*?{b}[^/]+)' for b in blacklist)
    m = df['url'].str.contains(pattern, case=False)
    df = df[~m]
    

    # print(df)
                               url        text
    0  http://example.com/abc/def/  fdogjdfgfd
    1    http://abcde.com/yzt/egd/  oijfgfdgdf
    

    You can test the regex here.

    Login or Signup to reply.
  2. You can first extract the part of the url up to the third '/' and then use you logic on this:

    mask = df["url"].str.extract(r'((?:[^/]*/[^/]*){,3})').str.contains(r'b(?:{})b'.format('|'.join(blacklist)))
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search