when testing my Lambda function. I always just get this error, this obviously does not help debugging:
OpenBLAS WARNING - could not determine the L2 cache size on this system, assuming 256k
START RequestId: a22379c9-427c-45ab-bfd5-c166bf507418 Version: $LATEST
2023-04-29T15:45:42.496Z a22379c9-427c-45ab-bfd5-c166bf507418 Task timed out after 600.11 seconds
END RequestId: a22379c9-427c-45ab-bfd5-c166bf507418
REPORT RequestId: a22379c9-427c-45ab-bfd5-c166bf507418 Duration: 600108.83 ms Billed Duration: 601489 ms Memory Size: 256 MB Max Memory Used: 154 MB Init Duration: 1379.93 ms
The function executes a docker container based of an image on ECR.
My lambda function handler looks like this:
import pandas as pd
import requests
from bs4 import BeautifulSoup
import math
from time import sleep
import pandas as pd
from datetime import date
from requests.adapters import HTTPAdapter
import boto3
from io import BytesIO
COLUMNS = [<<list of columns for the pandas df>> ]
HEADERS = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36",
"Accept-Encoding": "gzip, deflate, br",
"Referer": "https://www.bbc.com/news/entertainment-arts-64759120",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
"Connection": "keep-alive",
"Accept-Language": "en-US,en;q=0.9,lt;q=0.8,et;q=0.7,de;q=0.6"}
KEYWORDS = [ <<list of keywords>>]
def make_get_request(max_retries, URL):
session = requests.Session()
retry = HTTPAdapter(max_retries=max_retries)
session.mount('https://', retry)
session.mount('http://', retry)
try:
response = session.get(URL, headers=HEADERS)
return response
except (ConnectionError, requests.exceptions.Timeout) as err:
print(f"Failed to connect to the API, retrying... Error: {err}")
make_get_request(max_retries)
except requests.exceptions.TooManyRedirects as err:
print("Bad URL, try a different one")
raise SystemExit(err)
except requests.exceptions.HTTPError as err:
raise SystemExit(err)
def get_pages()->int:
BASE_URL="https://www.stepstone.de/jobs/data-engineer?action=facet_selected%3bage%3bage_1&ag=age_1"
response = make_get_request(3, BASE_URL)
soup = BeautifulSoup(response.text, 'html.parser')
result_tag = soup.find('span', 'res-kyg8or at-facet-header-total-results')
results = result_tag.text
results = results.replace(".", "")
pages = math.ceil(int(results)/25)
return pages
def extract_job_cards(title, days, i):
URL = f"https://www.stepstone.de/jobs/{title}?page_selected={i}&action=facet_selected%3bage%3bage_{days}&ag=age_{days}"
response = make_get_request(3, URL)
soup = BeautifulSoup(response.text, 'html.parser')
cards = soup.find_all('article', 'res-iibro8')
return cards
def request_job_description(href):
DESCRIPTION_LINK = "https://www.stepstone.de" + href
response_details = make_get_request(3, DESCRIPTION_LINK)
return response_details
def extract_keywords(card):
atag = card.h2.a
title_card = atag.div
title_card_str = title_card.text
title = title_card_str.split("</div></div></div>")[0]
print(title)
company_name = card.div.span.text
href = atag["href"]
occured_keywords = []
response_soup = BeautifulSoup(request_job_description(href).text, 'html.parser')
boxes = response_soup.find_all('div', 'listing-content-provider-10ltcrf')
if len(boxes)==4:
del boxes[0]
del boxes[-1]
for box in boxes:
text = box.find('span', 'listing-content-provider-pz58b2').get_text()
for keyword in KEYWORDS:
if keyword.upper() in text.upper():
occured_keywords.append(keyword.upper())
occured_keywords = list(dict.fromkeys(occured_keywords))
return occured_keywords, title, href, company_name
def append_to_df(occured_keywords, job_df, title, company_name, href):
job_dict = {
"LOADED_AT": date.today(),
"JOB_TITLE": title,
"COMPANY": company_name,
"HREF_LINK": href,
'PYTHON': 0
#There are obviously more key value pairs bit I left them out of #this post for simplicity reasons
}
for skill in occured_keywords:
job_dict[skill] = 1
row = []
for value in job_dict.values():
row.append(value)
job_df.loc[len(job_df)] = row
def extract_and_append_skills(cards, job_df):
for card in cards:
if all(job not in card.h2.a.div.text for job in ["Data Engineer", "DWH", "Data Warehouse", "ETL", "Analytics Engineer", "Business Intelligence", "Data Platform", "Data Architekt", "Data Architect"]):
continue
else:
keywords, title, href, company_name = extract_keywords(card)
append_to_df(keywords, job_df, title, company_name, href)
def main(event, context):
job_df = pd.DataFrame(columns=COLUMNS)
try:
for i in range(get_pages()):
cards = extract_job_cards('data-engineer', 1, i)
extract_and_append_skills(cards, job_df)
job_df = job_df[~job_df.duplicated(subset=['HREF_LINK'])].copy()
print(len(job_df))
return "Success"
except Exception as e:
print(e)
The Dockerfile like this:
FROM public.ecr.aws/lambda/python:3.9
COPY stepstone_scraper.py ${LAMBDA_TASK_ROOT}
COPY requirements.txt ./
RUN pip install -r requirements.txt -t "${LAMBDA_TASK_ROOT}"
RUN chmod 644 $(find . -type f)
RUN chmod 755 $(find . -type d)
CMD ["stepstone_scraper.main"]
The lambda function creation in terraform like this:
resource "aws_lambda_function" "job_scraping_function" {
package_type = "Image"
image_uri = "${aws_ecr_repository.scraping_repo.repository_url}:latest"
function_name = "job_scraping_function"
role = aws_iam_role.lambda_s3_role.arn
memory_size = 256
timeout = 600
depends_on = [null_resource.docker_build_and_push]
}
The underlying role can be assumed by lambda, has full s3, ec2, lambda, ECR and cloudwatch access and the AWSLambdaBasicExecutionRole arn is attached to it.
Has anyone got an idea what my issue might be?
2
Answers
Your
def main(event, context):
is incorrectly indented, so probably that’s why lambda can’t use the main function.You aren’t logging hardly anything, so I would expect the exact logs you are seeing. You need to add more logs to see what is happening before the timeout. This is not an issue of "getting proper logs", it is an issue of your Lambda function timing out.
The reason it is timing out is most likely because the network requests you are making don’t work from the Lambda environment. Most likely you have configured the Lambda function to run in a VPC, without deploying it to a subnet with a route to a NAT Gateway. If that’s not the case, then the URL you are trying to access on the Internet may have blocked AWS IPs.