I have a Dataframe with urls. I have a blacklist with words to filter these urls.
No I want to filter these urls until the third occurence of /
.
So for example:
Here I would like to filter only until the third occurence of /
.
So just:
http://example.com/
I read some similiar questions and I guess I need to combine two regexes.
-
/.*?/(.*?)/
this should do the job for filtering until the third occurence of/
-
to filter the for a list of words I use this expression:
mask = df["url"].str.contains(r'b(?:{})b'.format('|'.join(blacklist)))
df_new = df[~mask]
Now I don’t know how to combine these two expressions. I’m new to Python and especially regex so there also might be a smarter way of doing this task.
Thank you.
EDIT:
Blacklist looks like this: ["ebay","shop","camping","car"]
Df like this:
url text
http://example.com/abc/def/ fdogjdfgfd
http://abcde.com/yzt/egd/ oijfgfdgdf
http://ebay.com/buy/something fgfgeg
2
Answers
Use,
Series.str.contains
with the given regex pattern:You can test the regex
here
.You can first
extract
the part of the url up to the third'/'
and then use you logic on this: