I’ve got a basic Django project. One feature I am working on counts the number of most commonly occurring words in a .txt file, such as a large public domain book. I’ve used the Python Natural Language Tool Kit to filter out “stopwords” (in SEO language, that means redundant words such as ‘the’, ‘you’, etc. ).
Anyways, I’m getting this debug traceback when Django serves the template:
Resource [93mstopwords[0m not found. Please use the NLTK Downloader to
obtain the resource: [31m <<< import nltk nltk.download(‘stopwords’)
[0m For more information see: https://www.nltk.org/data.html
So I need to download the library of stopwords. To resolve the issue, I simply open a Python REPL on my remote server and invoke these two straightforward lines:
<<< import nltk
<<< nltk.download('stopwords')
That’s covered at length elsewhere on SO. That resolves the issue, but only temporarily. As soon as the REPL session is terminated on my remote server, the error returns because the stopwords file just evaporates.
I noticed something strange when I use git to push my changes up to my remote server on Heroku. Check this:
remote: -----> Python app detected
remote: -----> No change in requirements detected, installing from cache
remote: -----> Installing pip 20.1.1, setuptools 47.1.1 and wheel 0.34.2
remote: -----> Installing SQLite3
remote: -----> Installing requirements with pip
remote: -----> Downloading NLTK corpora…
remote: ! 'nltk.txt' not found, not downloading any corpora
remote: ! Learn more: https://devcenter.heroku.com/articles/python-nltk
remote: -----> $ python manage.py collectstatic --noinput
remote: 122 static files copied to '/tmp/build_f2f9d10f/staticfiles', 388 post-processed.
That devcenter link is kind of like a stub, meaning that it’s not very detailed. It’s sparse at best. The article says that to use Python nltk, you need to add an nltk.txt
file to the project directory which specifies the list of objects for Heroku to download. So I went ahead and created an nltk text file which contained:
corpora
Here is this active nltk.txt currently located in my project directory. In addition to coprora, I also tried adding various combinations of the following three entries to nltk.txt:
corpus
stoplist
english
I tried adding all four, just two and just one. For example, here is an alternate nltk.txt that I tried verbatim. My feeling is that the main one I really need is just corpora
, so that is the only entry in the nltk.txt that I am working with right now. With corpora
there, when I push the change and Heroku builds the environment, I see this error and trace-back:
remote: -----> Downloading NLTK corpora…
remote: -----> Downloading NLTK packages: corpora english stopwords corpus
remote: /app/.heroku/python/lib/python3.6/runpy.py:125: RuntimeWarning: 'nltk.downloader' found in sys.modules after import of package 'nltk', but prior to execution of 'nltk.downloader'; this may result in unpredictable behaviour
remote: warn(RuntimeWarning(msg))
remote: [nltk_data] Error loading corpora: Package 'corpora' not found in
remote: [nltk_data] index
remote: Error installing package. Retry? [n/y/e]
remote: Traceback (most recent call last):
remote: File "/app/.heroku/python/lib/python3.6/runpy.py", line 193, in _run_module_as_main
remote: "__main__", mod_spec)
remote: File "/app/.heroku/python/lib/python3.6/runpy.py", line 85, in _run_code
remote: exec(code, run_globals)
remote: File "/app/.heroku/python/lib/python3.6/site-packages/nltk/downloader.py", line 2538, in <module>
remote: halt_on_error=options.halt_on_error,
remote: File "/app/.heroku/python/lib/python3.6/site-packages/nltk/downloader.py", line 790, in download
I am clearly not using nltk.txt properly because it isn’t finding the corpora
package. I can install nltk and have it run without issue in my local dev server but my remaining question is this: how do I make Heroku handle nltk properly remotely in this situation?
User Michael Godshall provides the same answer to more than one Stack Overflow question explaining that you can create a bin
directory within the project root and add both a post_compile
bash script and a install_nltk_data
script. However this is no longer necessary because heroku-buildpack-python upstream maintainer Kenneth Reitz implemented an easy solution. All that is required now is to add an nltk.txt which contains the library you need. But I did that and I am still getting the error above.
The official nltk website documents how to use the library in general and how to install it which isn’t helpful in the case of Heroku because Heroku seems to handle nltk differently.
2
Answers
Eureka! I got it working. My problem was with the name of the nltk library download. I tried
stoplist
when the actual name isstopwords
. Ha! The contents of mynltk.txt
is now simply:stopwords
. When I pushed to Heroku, the build succeeded and my website is now deployed and accessible on the web.Special thanks goes out to @Darkknight for his patience and insight in the comment section of his answer.
Yes, you need the
nltk.txt
file similar to therequirements.txt
file properly. refer to the official doc here. if you still facing the same situation post thenltk.txt
file here that will give us some way to find the solutionmaybe this also will help you