How can I scrape PDFs within a Lambda function and save them to an S3 bucket? - Amazon web services

TarunShah
August 11, 2022
172 views
0 votes
2 Answers

I’m trying to develop a simple lambda function that will scrape a pdf and save it to an s3 bucket given the url and the desired filename as input data. I keep receiving the error "Read-only file system,’ and I’m not sure if I have to change the bucket permissions or if there is something else I am missing. I am new to S3 and Lambda and would appreciate any help.

This is my code:

import urllib.request
    import json
    import boto3


def lambda_handler(event, context):   
    s3 = boto3.client('s3') 
    url = event['url']
    filename = event['filename'] + ".pdf"
    response = urllib.request.urlopen(url)   
    file = open(filename, 'w')
    file.write(response.read())
    s3.upload_fileobj(response.read(), 'sasbreports', filename)
    file.close()

This was my event file:

{
  "url": "https://purpose-cms-preprod01.s3.amazonaws.com/wp-content/uploads/2022/03/09205150/FY21-NIKE-Impact-Report_SASB-Summary.pdf",
  "filename": "nike"
}

When I tested the function, I received this error:

{
  "errorMessage": "[Errno 30] Read-only file system: 'nike.pdf.pdf'",
  "errorType": "OSError",
  "requestId": "de0b23d3-1e62-482c-bdf8-e27e82251941",
  "stackTrace": [
    "  File "/var/task/lambda_function.py", line 15, in lambda_handlern    file = open(filename + ".pdf", 'w')n"
  ]
}

Answers

- JohnRotenstein
- August 12, 2022 at 8:44 am
- 0 votes
0
AWS Lambda functions can only write to the /tmp/ directory. All other directories are Read-Only.

Also, there is a default limit of 512MB for storage in /tmp/, so make sure you delete the files after upload it to S3 for situations where the Lambda environment is re-used for future executions.

Login or Signup to reply.

- fdaugan
- August 12, 2022 at 12:47 pm
- 0 votes
0
AWS Lambda has limited space in /tmp, the sole writable location.
Writing into this space can be dangerous without a proper disk management since this storage is kept alive across multiple executions. It can lead to a saturation or unexpected file share with previous requests.
Instead of saving locally the PDF, write it directly to S3, without involving file system this way:
```
import urllib.request
import json
import boto3


def lambda_handler(event, context):   
    s3 = boto3.client('s3') 
    url = event['url']
    filename = event['filename']
    response = urllib.request.urlopen(url)   
    s3.upload_fileobj(response.read(), 'sasbreports', filename)
```
BTW: The .pdf appending should be removed according your use case.
Login or Signup to reply.

Please signup or login to give your own answer.

Click here to cancel reply.

How can I scrape PDFs within a Lambda function and save them to an S3 bucket? – Amazon web services

Answers