I’m trying to rank the most frequently used words in a large text file – – Alice and Wonderland (which is public domain). Here is Alice and Wonderland on Dropbox and on Pastebin. It runs and as expected, there are 1818 instances of “the” and 940 instances of “and”.
But now in my latest iteration of the script, I’m trying to filter out the most commonly used words such as “and”, “there”, “the”, “that”, “to”, “a” etc. Any search algorithm out there looks for words like these (called stop words in SEO terminology) and excludes them from the query. The Python library I’ve imported for this task is nltk.corpus.
When I generate a stop words list and invoke the filter, all the instances of “the” and “of” are filtered out as expected but it’s not catching “and” or “you”. It’s not clear to me why.
I’ve tried reinforcing the stop words list by manually and explicitly adding words which appear in the output that shouldn’t be there. I’ve added “said”, “you”, “that”, and others yet they still appear as among the top 10 most common words in the text file.
Here is my script:
from collections import Counter
from nltk.corpus import stopwords
import re
def open_file():
with open('Alice.txt') as f:
text = f.read().lower()
return text
def main(text):
stoplist = stopwords.words('english') # Bring in the default English NLTK stop words
stoplist.extend(["said", "i", "it", "you", "and","that",])
# print(stoplist)
clean = [word for word in text.split() if word not in stoplist]
clean_text = ' '.join(clean)
words = re.findall('w+', clean_text)
top_10 = Counter(words).most_common(10)
for word,count in top_10:
print(f'{word!r:<4} {"-->":^4} {count:>4}')
if __name__ == "__main__":
text = open_file()
main(text)
Here is my actual output:
$ python script8.py
‘alice’ –> 403
‘i’ –> 283
‘it’ –> 205
‘s’ –> 184
‘little’ –> 128
‘you’ –> 115
‘and’ –> 107
‘one’ –> 106
‘gutenberg’ –> 93
‘that’ –> 92
What I am expecting is for all the instances of "i", "it" and "you" to be excluded from this list but they are still appearing and it is not clear to me why.
2
Answers
for example:
"it's".split()
>> [it’s]re.findall('w+', "it's")
>> [it, s]that is why “stoplist” won’t be like you think.
fix:
output
note:
"i", "it" and "you" to be excluded from your list
Your code does this:
First you split the text on whitespace using
text.split()
. But the resulting list of ‘words’ still includes punctuation, likeas,
,head!'
and'i
(note that'
is used as a quotation-mark as well as an apostrophe).Then you exclude any ‘words’ that have a match in
stopwords
. This will excludei
but not'i
.Next you re-join all the remaining words using spaces.
Then you use a
'w+'
regex to search for sequences of letters (NOT including punctuation): so'i
will match asi
. That’s whyi
ands
are showing up in your top 10.There are a couple ways to fix this. For example, you can use
re.split()
to split on more than just whitespace:Output:
Note that this is treating hyphenated phrases separately: so
gutenberg-tm
->gutenberg
,tm
. For more control over this, you could follow Jay’s suggestion and look at nltk.tokenize. For example, the nltk tokenizer is aware of contractions, sodon't
->do
+n't
.You could also improve things by removing the Gutenberg Licensing conditions from your text 🙂