skip to Main Content

when testing my Lambda function. I always just get this error, this obviously does not help debugging:

    OpenBLAS WARNING - could not determine the L2 cache size on this system, assuming 256k
START RequestId: a22379c9-427c-45ab-bfd5-c166bf507418 Version: $LATEST
2023-04-29T15:45:42.496Z a22379c9-427c-45ab-bfd5-c166bf507418 Task timed out after 600.11 seconds

END RequestId: a22379c9-427c-45ab-bfd5-c166bf507418
REPORT RequestId: a22379c9-427c-45ab-bfd5-c166bf507418  Duration: 600108.83 ms  Billed Duration: 601489 ms  Memory Size: 256 MB Max Memory Used: 154 MB Init Duration: 1379.93 ms   

The function executes a docker container based of an image on ECR.

My lambda function handler looks like this:

import pandas as pd
import requests
from bs4 import BeautifulSoup
import math
from time import sleep 
import pandas as pd 
from datetime import date 
from requests.adapters import HTTPAdapter
import boto3
from io import BytesIO

COLUMNS = [<<list of columns for the pandas df>> ]

HEADERS = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36",
    "Accept-Encoding": "gzip, deflate, br",
    "Referer": "https://www.bbc.com/news/entertainment-arts-64759120",
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
    "Connection": "keep-alive",
    "Accept-Language": "en-US,en;q=0.9,lt;q=0.8,et;q=0.7,de;q=0.6"}

KEYWORDS  = [ <<list of keywords>>]

def make_get_request(max_retries, URL):

    session = requests.Session()
    retry = HTTPAdapter(max_retries=max_retries)
    session.mount('https://', retry)
    session.mount('http://', retry)
    
    try:
        response = session.get(URL, headers=HEADERS)
        return response
    
    except (ConnectionError, requests.exceptions.Timeout) as err:
        print(f"Failed to connect to the API, retrying... Error: {err}")
        make_get_request(max_retries)
    except requests.exceptions.TooManyRedirects as err:
        print("Bad URL, try a different one")
        raise SystemExit(err)
    except requests.exceptions.HTTPError as err:
        raise SystemExit(err)


def get_pages()->int:

    BASE_URL="https://www.stepstone.de/jobs/data-engineer?action=facet_selected%3bage%3bage_1&ag=age_1"

    response = make_get_request(3, BASE_URL)
    soup = BeautifulSoup(response.text, 'html.parser') 

    result_tag = soup.find('span', 'res-kyg8or at-facet-header-total-results')
    results = result_tag.text
    results = results.replace(".", "")
    pages = math.ceil(int(results)/25)

    return pages 


def extract_job_cards(title, days, i):

    URL = f"https://www.stepstone.de/jobs/{title}?page_selected={i}&action=facet_selected%3bage%3bage_{days}&ag=age_{days}"
    response = make_get_request(3, URL)
    soup = BeautifulSoup(response.text, 'html.parser')  
    cards = soup.find_all('article', 'res-iibro8')

    return cards 


def request_job_description(href):
        DESCRIPTION_LINK = "https://www.stepstone.de" + href
        response_details = make_get_request(3, DESCRIPTION_LINK)

        return response_details


def extract_keywords(card):

    atag = card.h2.a

    title_card = atag.div
    title_card_str = title_card.text
    title = title_card_str.split("</div></div></div>")[0]
    print(title)
    company_name = card.div.span.text
    href = atag["href"]
    occured_keywords = []


    response_soup = BeautifulSoup(request_job_description(href).text, 'html.parser')
    boxes = response_soup.find_all('div', 'listing-content-provider-10ltcrf')
    if len(boxes)==4:
        del boxes[0]
        del boxes[-1]
    
    for box in boxes:
        text = box.find('span', 'listing-content-provider-pz58b2').get_text()
        for keyword in KEYWORDS:
            if keyword.upper() in text.upper():
                occured_keywords.append(keyword.upper())

        occured_keywords = list(dict.fromkeys(occured_keywords))

    return occured_keywords, title, href, company_name

def append_to_df(occured_keywords, job_df, title, company_name, href):

    job_dict = {
        "LOADED_AT": date.today(),
        "JOB_TITLE": title,
        "COMPANY": company_name,
        "HREF_LINK": href,
        'PYTHON': 0
#There are obviously more key value pairs bit I left them out of #this post for simplicity reasons 
    }

    for skill in occured_keywords:
        job_dict[skill] = 1

    row = []
    for value in job_dict.values():
        row.append(value)

    job_df.loc[len(job_df)] = row


def extract_and_append_skills(cards, job_df):

    for card in cards:
        if all(job not in card.h2.a.div.text for job in ["Data Engineer", "DWH", "Data Warehouse", "ETL", "Analytics Engineer", "Business Intelligence", "Data Platform", "Data Architekt", "Data Architect"]):
            continue
        else:
            keywords, title, href, company_name = extract_keywords(card)
            append_to_df(keywords, job_df, title, company_name, href)


    def main(event, context):    
    
        job_df = pd.DataFrame(columns=COLUMNS)
        try:
            for i in range(get_pages()):
                cards = extract_job_cards('data-engineer', 1, i)
                extract_and_append_skills(cards, job_df)
            job_df = job_df[~job_df.duplicated(subset=['HREF_LINK'])].copy()
            print(len(job_df))

            return "Success"
        except Exception as e:
            print(e)

The Dockerfile like this:

FROM public.ecr.aws/lambda/python:3.9
  
COPY stepstone_scraper.py ${LAMBDA_TASK_ROOT}
COPY requirements.txt ./

RUN pip install -r requirements.txt -t "${LAMBDA_TASK_ROOT}"

RUN chmod 644 $(find . -type f)
RUN chmod 755 $(find . -type d)

CMD ["stepstone_scraper.main"]

The lambda function creation in terraform like this:

resource "aws_lambda_function" "job_scraping_function" {
package_type  = "Image"
image_uri     = "${aws_ecr_repository.scraping_repo.repository_url}:latest"
function_name = "job_scraping_function"
role = aws_iam_role.lambda_s3_role.arn
memory_size  = 256
timeout      = 600
depends_on = [null_resource.docker_build_and_push]
}

The underlying role can be assumed by lambda, has full s3, ec2, lambda, ECR and cloudwatch access and the AWSLambdaBasicExecutionRole arn is attached to it.

Has anyone got an idea what my issue might be?

2

Answers


  1. Your def main(event, context): is incorrectly indented, so probably that’s why lambda can’t use the main function.

    Login or Signup to reply.
  2. You aren’t logging hardly anything, so I would expect the exact logs you are seeing. You need to add more logs to see what is happening before the timeout. This is not an issue of "getting proper logs", it is an issue of your Lambda function timing out.

    The reason it is timing out is most likely because the network requests you are making don’t work from the Lambda environment. Most likely you have configured the Lambda function to run in a VPC, without deploying it to a subnet with a route to a NAT Gateway. If that’s not the case, then the URL you are trying to access on the Internet may have blocked AWS IPs.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search