skip to Main Content

I am writing up a scrapy python project that scraps data for a future ML project. I decided to containerize my project in Docker – below is my DockerFile:

FROM python:3.9.12-slim-buster

WORKDIR /app

RUN apt-get update && apt-get install -y git

RUN pip3 install --upgrade pip

COPY requirements.txt requirements.txt

RUN pip install -r requirements.txt

ADD . /app

I am able to run the following command and my scraper will run successfully:

docker run -it  ufc-stats-scraper scrapy crawl ufc_future_fights -o future.csv -t csv

output:

....
 'robotstxt/response_status_count/404': 1,
 'scheduler/dequeued': 53,
 'scheduler/dequeued/memory': 53,
 'scheduler/enqueued': 53,
 'scheduler/enqueued/memory': 53,
 'start_time': datetime.datetime(2022, 4, 20, 2, 4, 7, 365309)}
2022-04-20 02:04:08 [scrapy.core.engine] INFO: Spider closed (finished)

However, the scraped data is stored in the future.csv file, which is local to the container. I read online that I should use the -v command and mount the containers folder. Below is the command that I am trying to use:

 docker run -it -v ${PWD}:/app ufc-stats-scraper scrapy crawl ufc_future_fights -o future.csv -t csv

However, I am getting the error message below when running this command:

Scrapy 2.6.1 - no active project

Unknown command: crawl

Use "scrapy" to see available command

I’m fairly new to Docker but was wondering if anyone had any input on what I could be doing wrong here. Thanks!

Update:

I started a bash session on two versions of my docker image. One session was unmounted and the other was mounted. The app folder contained all of the repos files in the unmounted session. The weird thing is that the app folder is completely empty on the mounted session. This would explain why the error message reads "no active project". I am really confused why the mounted image is empty though.

I feel like I might be misunderstanding how the the Docker bind mount stuff works.

2

Answers


  1. Chosen as BEST ANSWER

    The issue is that I was mounting over the app/ directory and deleting all of the directories files. Instead of mounting the app directory, I created a new data folder within the app directory and mounted it.


  2. hope you are enjoying your containers journey !

    Since I have no more informations to reproduce exactly your situation, I will just propose you what could work for you:

    Here you got an error that said: "Unknown command: crawl", it means that the docker binary is interpreting your "crawl" argument of the scrapy command as a standalone command.

    To avoid this, instead of running:

     docker run -it -v ${PWD}:/app ufc-stats-scraper scrapy crawl ufc_future_fights -o future.csv -t csv
    

    you should run your scrapy command within "bash -c", like this:

      docker run -it -v ${PWD}:/app ufc-stats-scraper  bash -c "scrapy crawl ufc_future_fights -o future.csv -t csv"
    

    For you information, you can directly add the VOLUME configuration and your entrypoint command directly into your dockerfile, this way, you should be able to run your container just with docker run -it (or -d) ufc-stats-scraper (cf https://kapeli.com/cheat_sheets/Dockerfile.docset/Contents/Resources/Documents/index)

    Hope this will help you !
    bguess

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search