skip to Main Content

I’m working on an application which main task is to scrap text from a LinkedIn profile, process that text and return a dict with the most common words from the profile. Everything worked perfectly on local machine, but problems have occurred when I decide to deploy this on Heroku. It seems out my scraping process take almost 5-7 minutes, so I reached Request Timeout on Heroku. To avoid that, I apply Celery to my project to run this process in background. And now I have an issue to deploy this smoothly on Heroku.

Structure of project:

web-sourcing-tools
├── app
│   ├── agents
│   │   ├── __init__.py
│   │   ├── data_processing.py
│   │   ├── scraper.py
│   │   └── string_builder.py
│   ├── library
│   │   └── helpers.py
│   ├── pages
│   │   ├── __init__.py
│   │   └── home.md
│   └── __init__.py
├── static
│   ├── css
│   │   ├── mystyle.css
│   │   └── style3.css
│   └── images
│       └── favicon.ico
├── templates
│   ├── include
│   │   ├── sidebar.html
│   │   └── topnav.html
│   ├── base.html
│   ├── form.html
│   └── page.html
├── .gitignore
├── __init__.py
├── main.py
├── nltk.txt
├── Procfile
├── README.md
├── requirements.txt
├── runtime.txt
└── tasks.py

Procfile:

web: gunicorn -w 4 -k uvicorn.workers.UvicornWorker main:app
worker: celery worker --app=tasks.app

runtime.txt

runtime.txt

task.py

from celery import Celery
import os
from app.agents.scraper import Scraper

app = Celery(__name__)
app.conf.update(
    BROKER_URL=os.environ["REDIS_URL"],
    CELERY_RESULT_BACKEND=os.environ["REDIS_URL"]
)


@app.task(name="scraper")
def scraper(username, password, query, n_pages):
    results = Scraper(username, password, query, n_pages)
    return results

main.py

from fastapi import FastAPI, Request, Form
from fastapi.responses import HTMLResponse
from fastapi.templating import Jinja2Templates
from fastapi.staticfiles import StaticFiles
from app.library.helpers import *
from app.agents.string_builder import string_builder
from tasks import scraper


LOGIN = os.environ.get("LOGIN")
PASS = os.environ.get("PASS")

app = FastAPI()
templates = Jinja2Templates(directory="templates")
app.mount("/static", StaticFiles(directory="static"), name="static")


@app.get("/", response_class=HTMLResponse)
async def home(request: Request):
    data = openfile("home.md")
    return templates.TemplateResponse("page.html", {"request": request, "data": data})


@app.post("/common-words")
def form_post(
    request: Request,
    string_or: str = Form(...),
    string_and: str = Form(...),
    string_not: str = Form(...),
):
    query = string_builder(OR=string_or, AND=string_and, NOT=string_not)
    n_page = 2
    task = scraper.delay(LOGIN, PASS, query, n_page)
    return templates.TemplateResponse(
        "form.html", context={"request": request, "result": task.get()}
    )


@app.get("/common-words")
def form_post(request: Request):

    result = ""
    return templates.TemplateResponse(
        "form.html", context={"request": request, "result": result}
    )


if __name__ == "__main__":
    app.run()

error from heroku console:

2022-01-17T23:42:38.383531+00:00 heroku[router]: at=info method=GET path="/common-words" host=web-sourcing-tools.herokuapp.com request_id=21dd948a-b7e5-46f8-8c1c-9b5a3f091592 fwd="95.175.20.47" dyno=web.1 connect=0ms service=7ms status=200 bytes=6691 protocol=https
2022-01-17T23:43:11.505703+00:00 heroku[router]: at=error code=H12 desc="Request timeout" method=POST path="/common-words" host=web-sourcing-tools.herokuapp.com request_id=59465f6f-27d0-4583-83e8-40e6e6e5bd8d fwd="95.175.20.47" dyno=web.1 connect=0ms service=30000ms status=503 bytes=0 protocol=https
2022-01-17T23:43:12.148229+00:00 app[web.1]: 95.175.20.47:0 - "GET /favicon.ico HTTP/1.1" 404
2022-01-17T23:43:12.149208+00:00 heroku[router]: at=info method=GET path="/favicon.ico" host=web-sourcing-tools.herokuapp.com request_id=e079f8a2-a58b-4b3c-8bda-c2d4acd362ef fwd="95.175.20.47" dyno=web.1 connect=0ms service=3ms status=404 bytes=173 protocol=https
2022-01-17T23:44:10.922495+00:00 heroku[router]: at=error code=H12 desc="Request timeout" method=POST path="/common-words" host=web-sourcing-tools.herokuapp.com request_id=2664e65d-30a9-485f-8048-f67515d624a4 fwd="95.175.20.47" dyno=web.1 connect=0ms service=30000ms status=503 bytes=0 protocol=https
2022-01-17T23:44:15.101837+00:00 heroku[router]: at=error code=H12 desc="Request timeout" method=POST path="/common-words" host=web-sourcing-tools.herokuapp.com request_id=7c88d428-e3e5-4b0e-88f9-4769ac229c24 fwd="95.175.20.47" dyno=web.1 connect=0ms service=30000ms status=503 bytes=0 protocol=https
2022-01-17T23:42:56.000000+00:00 app[heroku-redis]: source=REDIS addon=redis-closed-93849 sample#active-connections=5 sample#load-avg-1m=0.16 sample#load-avg-5m=0.205 sample#load-avg-15m=0.215 sample#read-iops=0 sample#write-iops=0 sample#memory-total=15619140kB sample#memory-free=10414152kB sample#memory-cached=2560180kB sample#memory-redis=433568bytes sample#hit-rate=0.21569 sample#evicted-keys=0
2022-01-17T23:46:40.000000+00:00 app[heroku-redis]: source=REDIS addon=redis-closed-93849 sample#active-connections=8 sample#load-avg-1m=0.095 sample#load-avg-5m=0.15 sample#load-avg-15m=0.185 sample#read-iops=0 sample#write-iops=0 sample#memory-total=15619140kB sample#memory-free=10413852kB sample#memory-cached=2560192kB sample#memory-redis=499248bytes sample#hit-rate=0.21053 sample#evicted-keys=0
2022-01-17T23:50:40.000000+00:00 app[heroku-redis]: source=REDIS addon=redis-closed-93849 sample#active-connections=4 sample#load-avg-1m=0.175 sample#load-avg-5m=0.14 sample#load-avg-15m=0.17 sample#read-iops=0 sample#write-iops=0 sample#memory-total=15619140kB sample#memory-free=10414380kB sample#memory-cached=2560276kB sample#memory-redis=415400bytes sample#hit-rate=0.21053 sample#evicted-keys=0
2022-01-17T23:54:36.000000+00:00 app[heroku-redis]: source=REDIS addon=redis-closed-93849 sample#active-connections=4 sample#load-avg-1m=0.09 sample#load-avg-5m=0.1 sample#load-avg-15m=0.145 sample#read-iops=0 sample#write-iops=0 sample#memory-total=15619140kB sample#memory-free=10418696kB sample#memory-cached=2560544kB sample#memory-redis=415400bytes sample#hit-rate=0.21053 sample#evicted-keys=0
2022-01-17T23:58:20.000000+00:00 app[heroku-redis]: source=REDIS addon=redis-closed-93849 sample#active-connections=4 sample#load-avg-1m=0.18 sample#load-avg-5m=0.135 sample#load-avg-15m=0.145 sample#read-iops=0 sample#write-iops=0 sample#memory-total=15619140kB sample#memory-free=10418720kB sample#memory-cached=2560560kB sample#memory-redis=415400bytes sample#hit-rate=0.21053 sample#evicted-keys=0
2022-01-18T00:02:16.000000+00:00 app[heroku-redis]: source=REDIS addon=redis-closed-93849 sample#active-connections=4 sample#load-avg-1m=0.355 sample#load-avg-5m=0.315 sample#load-avg-15m=0.215 sample#read-iops=0 sample#write-iops=0.063241 sample#memory-total=15619140kB sample#memory-free=10421644kB sample#memory-cached=2560532kB sample#memory-redis=415400bytes sample#hit-rate=0.21053 sample#evicted-keys=0

In task.py I import my main script from from app.agents.scraper import Scraper – this class return dict contain to values – word and quantity.

In Heroku I add my Config Vars like:
enter image description here

Do you have any thoughts when I made mistke?

2

Answers


  1. It may be that LinkedIn blocks/rate limits IP ranges of cloud hosting services. See this commentary: https://github.com/spinlud/linkedin-jobs-scraper/issues/10#issuecomment-692537789

    Login or Signup to reply.
  2. Did you activated the worker dyno from the resources pages in heroku? In my case I missed that.
    Heroku celery worker dyno

    Also you can update your Procfile like this:

    web: gunicorn -w 4 -k uvicorn.workers.UvicornWorker main:app
    worker: celery -A tasks worker
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search