skip to Main Content

I’m trying to rank the most frequently used words in a large text file – – Alice and Wonderland (which is public domain). Here is Alice and Wonderland on Dropbox and on Pastebin. It runs and as expected, there are 1818 instances of “the” and 940 instances of “and”.

But now in my latest iteration of the script, I’m trying to filter out the most commonly used words such as “and”, “there”, “the”, “that”, “to”, “a” etc. Any search algorithm out there looks for words like these (called stop words in SEO terminology) and excludes them from the query. The Python library I’ve imported for this task is nltk.corpus.

When I generate a stop words list and invoke the filter, all the instances of “the” and “of” are filtered out as expected but it’s not catching “and” or “you”. It’s not clear to me why.

I’ve tried reinforcing the stop words list by manually and explicitly adding words which appear in the output that shouldn’t be there. I’ve added “said”, “you”, “that”, and others yet they still appear as among the top 10 most common words in the text file.

Here is my script:

from collections import Counter
from nltk.corpus import stopwords
import re

def open_file():
   with open('Alice.txt') as f:
       text = f.read().lower()
   return text

def main(text):
   stoplist = stopwords.words('english') # Bring in the default English NLTK stop words
   stoplist.extend(["said", "i", "it", "you", "and","that",])
   # print(stoplist)
   clean = [word for word in text.split() if word not in stoplist]
   clean_text = ' '.join(clean)
   words = re.findall('w+', clean_text)
   top_10 = Counter(words).most_common(10)
   for word,count in top_10:
       print(f'{word!r:<4} {"-->":^4} {count:>4}')

if __name__ == "__main__":
   text = open_file()
   main(text)

Here is my actual output:

$ python script8.py

‘alice’ –> 403

‘i’ –> 283

‘it’ –> 205

‘s’ –> 184

‘little’ –> 128

‘you’ –> 115

‘and’ –> 107

‘one’ –> 106

‘gutenberg’ –> 93

‘that’ –> 92

What I am expecting is for all the instances of "i", "it" and "you" to be excluded from this list but they are still appearing and it is not clear to me why.

2

Answers


  1. for example:

    "it's".split() >> [it’s]

    re.findall('w+', "it's") >> [it, s]

    that is why “stoplist” won’t be like you think.

    fix:

    def main(text):
        words = re.findall('w+', text)
        counter = Counter(words)
        stoplist = stopwords.words('english')
        #stoplist.extend(["said", "i", "it", "you", "and", "that", ])
        stoplist.extend(["said", "i", "it", "you"])
        [stoplist.remove(keep_word) for keep_word in ['s', 'and', 'that']]
        for stop_word in stoplist:
            del counter[stop_word]
        for word, count in counter.most_common(10):
            print(f'{word!r:<4} {"-->":^4} {count:>4}')
    

    output

    'and' -->   940
    'alice' -->   403
    'that' -->   330
    's'  -->   219
    'little' -->   128
    'one' -->   106
    'gutenberg' -->    93
    'know' -->    88
    'project' -->    86
    'like' -->    85
    

    note: "i", "it" and "you" to be excluded from your list

    Login or Signup to reply.
  2. Your code does this:

    1. First you split the text on whitespace using text.split(). But the resulting list of ‘words’ still includes punctuation, like as,, head!' and 'i (note that ' is used as a quotation-mark as well as an apostrophe).

    2. Then you exclude any ‘words’ that have a match in stopwords. This will exclude i but not 'i.

    3. Next you re-join all the remaining words using spaces.

    4. Then you use a 'w+' regex to search for sequences of letters (NOT including punctuation): so 'i will match as i. That’s why i and s are showing up in your top 10.

    There are a couple ways to fix this. For example, you can use re.split() to split on more than just whitespace:

    def main(text):
       stoplist = stopwords.words('english')
       stoplist.extend(["said"]) # stoplist already includes "i", "it", "you"
       clean = [word for word in re.split(r"W+", text) if word not in stoplist]
       top_10 = Counter(clean).most_common(10)
       for word,count in top_10:
           print(f'{word!r:<4} {"-->":^4} {count:>4}')
    

    Output:

    'alice' -->   403
    'little' -->   128
    'one' -->   106
    'gutenberg' -->    93
    'know' -->    88
    'project' -->    87
    'like' -->    85
    'would' -->    83
    'went' -->    83
    'could' -->    78
    

    Note that this is treating hyphenated phrases separately: so gutenberg-tm -> gutenberg, tm. For more control over this, you could follow Jay’s suggestion and look at nltk.tokenize. For example, the nltk tokenizer is aware of contractions, so don't -> do + n't.

    You could also improve things by removing the Gutenberg Licensing conditions from your text 🙂

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search