skip to Main Content

I just finished the following function on getting customer data from my shopify into an S3 bucket. What happens now is the following. A trigger runs this lambda on a daily basis. Then, all customers are written to an S3 bucket. Every already existing entry is just overwritten. New customers are added.

My question is: Is this a scalable approach or should I read all the files and compare timestamps to only add the new entries? Or is this second approach maybe worse?

import requests
import json
import boto3

s3 = boto3.client('s3')
bucket ='testbucket'

url2 = "something.json"

def getCustomers():
    r = requests.get(url2)
    return r.json()

def lambda_handler(event, context):
    
    data = getCustomers()
    
    for customer in data["customers"]:
        
        #create a unique id for each customer
        customer_id = str(customer["id"])
        #create a file name to put the customer in bucket
        file_name = 'customers' + '/' + customer_id + '.json'
        
        #Saving .json to s3
        customer_string = str(customer)
        uploadByteStream = bytes(customer_string.encode('UTF-8')) 
        
        s3.put_object(Bucket=bucket, Key=file_name, Body=uploadByteStream)
        
        
    return {
            'statusCode': 200,
            'body': json.dumps('Success')
        }    

An example response is the following:

{
  "id": 71806090000,
  "email": "[email protected]",
  "accepts_marketing": false,
  "created_at": "2021-07-27T11:06:38+02:00",
  "updated_at": "2021-07-27T11:11:58+02:00",
  "first_name": "Bertje",
  "last_name": "Bertens",
  "orders_count": 0,
  "state": "disabled",
  "total_spent": "0.00",
  "last_order_id": null,
  "note": "",
  "verified_email": true,
  "multipass_identifier": null,
  "tax_exempt": false,
  "phone": "+32470000000",
  "tags": "",
  "last_order_name": null,
  "currency": "EUR",
  "addresses": [
    {
      "id": 6623179276486,
      "customer_id": 5371846099142,
      "first_name": "Bertje",
      "last_name": "Bertens",
      "company": "",
      "address1": "Somewhere",
      "address2": "",
      "city": "Somecity",
      "province": null,
      "country": "",
      "zip": "0000",
      "phone": null,
      "name": "Bertje Bertens",
      "province_code": null,
      "country_code": null,
      "country_name": "",
      "default": true
    }
  ],
  "accepts_marketing_updated_at": "2021-07-27T11:11:35+02:00",
  "marketing_opt_in_level": null,
  "tax_exemptions": [],
  "admin_graphql_api_id": "",
  "default_address": {
    "id": 6623179276486,
    "customer_id": 5371846099142,
    "first_name": "Bertje",
    "last_name": "Bertens",
    "company": "",
    "address1": "Somewhere",
    "address2": "",
    "city": "Somecity",
    "province": null,
    "country": "",
    "zip": "0000",
    "phone": null,
    "name": "Bertje Bertens",
    "province_code": null,
    "country_code": null,
    "country_name": "",
    "default": true
  }
}

2

Answers


  1. It will work as long as you manage to finish the whole process within the max 15 minute timeout of Lambda.
    S3 is built to scale to much more demanding workloads 😉

    But:

    It’s very inefficient as you already observed. A better implementation would be to keep track of the timestamp of the last full load somewhere, e.g. DynamoDB or the Systems Manager parameter store and only write all customers where the "created_at" or "updated_at" attributes are after the last successful full load. In the end you update the full load timestamp.

    Here is some pseudo code:

    last_full_load_date = get_last_full_load() or '1900-01-01T00:00:00Z'
    
    customers = get_customers()
    
    for customer in customers:
        if customer.created_at >= last_full_load_date or customer.updated_at >= last_full_load_date:
            write_customer(customer)
    
    set_last_full_load(datetime.now())
    

    This way you only write data that has actually changed (assuming the API is reliable).

    This also has the benefit, that you’ll be able to retry if something goes wrong during writing since you only update the last_full_load time in the end. Alternatively you could keep track of the last modified time per user, but that seems not necessary if you to a bulk load anyway.

    Login or Signup to reply.
  2. Is this a scalable approach or should I read all the files and compare timestamps to only add the new entries? Or is this second approach maybe worse?

    Generally speaking, you’re not going to run into many scalability problems with a daily task utilizing Lambda and S3.

    Some considerations:

    1. Costs
      a. Lambda execution costs. The longer your lambda runs, the more time you pay
      b. S3 Transfer costs. Unless you run your lambda in a VPC and setup a VPC endpoint for your bucket, you pay S3 transfer costs from lambda -> internet (-> s3).

    2. Lambda execution timeouts.
      If you have many files to upload, you may eventually run into a problem where you have so many files to transfer it can’t be completed within a single invocation.

    3. Fault tolerance
      Right now, if your lambda fails for some reason, you’ll drop all the work for the day.

    How do these two approaches bear on these considerations?

    For (1) you simply have to calculate your costs. Technically, the approach of checking the timestamp first will help you here. However, my guess is that, if you’re only running this on a daily basis within a single invocation, the costs are minimal right now and not of much concern. We’re talking pennies per month at most (~$0.05/mo @ full 15 minute invocation once daily + transfer costs).

    For (2) the approach of checking timestamps is also somewhat better, but doesn’t truly address the scalability issue. If you expect you may eventually reach a point where you will run out of execution time in Lambda, you may want to consider a new architecture for the solution.

    For (3) neither approach has any real bearing. Either way, you have the same fault tolerance problem.

    Possible alternative architecture components to address these areas may include:

    • use of SQS to queue file transfers (help with decoupling and DLQ for fault tolerance)
    • use of scheduled (fargate) ECS tasks instead of Lambda for compute (deal with Lambda timeout limitations) OR have lambda consume the queue in batches
    • S3 VPC endpoints and in-vpc compute (optimize s3 transfer; likely not cost effective until much larger scale)

    So, to answer the question directly in summary:

    The current solution has some scalability concerns, namely the execution timeout of lambda and fault tolerance concerns. The second approach does introduce optimizations, but they do not address the scalability concerns. Additionally, the value you get from the second solution may not be significant.

    In any case, what you propose makes sense and shouldn’t take much effort to implement.

    ...
    customer_updated_at = datetime.datetime.fromisoformat(customer['created_at'])
    
    file_name = 'customers' + '/' + customer_id + '.json'
    
    # Send HEAD request to check date to see if we need to update it
    response = s3.head_object(bucket, file_name)
    s3_modified = response["LastModified"]
    if customer_updated_at > s3_modified:
        # Saving .json to s3
        customer_string = str(customer)
        uploadByteStream = bytes(customer_string.encode('UTF-8'))
        s3.put_object(Bucket=bucket, Key=file_name, Body=uploadByteStream)
    else:
        print('s3 version is up to date, no need to upload')
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search