skip to Main Content

A Python Shell job cannot use more than one DPU. This means that it has a limit of 16 GB of memory.

Earlier today, I wired what I considered to be a modest ETL task to AWS Glue with 1 DPU. It was written in Python Shell. It ran for just over 10 minutes. In theory, this task queries a Microsoft SQL Server hosted in EC2 and produces 3,000,000 rows with about 250 columns and writes it to CSV. The CSV should come to about 2.5 GB. In practice, I get

Command Failed due to Out of Memory

from Glue. As far as I can tell, this error does not come from SQL; It comes from 1 DPU not being enough. Batching the queries and writes to CSV fixed this problem, but I did not wish to need to do that.

This confuses me very deeply. I do not consider 2.5 GB of data to be an unreasonable amount to ETL. As a person with a SQL background, I eat 2.5 GB for breakfast and I was doing this very same task in SSIS years ago. The Python does not do any complex manipulation of the data. It just grabs it from SQL and writes it to new CSV files on S3.

This gives me my question. AWS Glue is advertised as a cloud-scale ETL tool, but my experience described above indicates that it cannot manage modest ETL tasks. What am I missing and how can these limitations be overcome?

2

Answers


  1. I’m not familiar with Glue, but have you seen duckdb?

    It has a Python client, but is C++ under the hood and is very memory efficient compared to vanilla Python.

    It has ODBC support so you can connect directly to the SQL Server, do the query, and use CSV export to write directly to a temp file, and then use use Python to write that file where ever you need it.

    Login or Signup to reply.
  2. The main issue is that you are using a simple Python shell job and Python’s memory management is not always optimized for handling large datasets efficiently. In this case, you should configure your job to run on Spark engine.

    A Spark job is run in an Apache Spark environment managed by AWS Glue. You can still write your code in Python (using pyspark) the only difference is the underlying infrastructure. It is well optimized for memory-intensive tasks and also comes with optimized built-in connectors for many database systems (see here for more details).

    There are different ways you can create and configure a Glue job.

    You can use the visual console to create a job, you need to navigate to AWS Glue -> ETL Jobs -> Create job -> Author code with a script editor. Then you have to set the engine to Spark like this:
    Configure glue job engine through

    If you are using CloudFormation or AWS CDK you need to specify the name under command to glueetl which will set spark as the engine, as opposed to setting it to pythonshell for python shell jobs.

    CloudFormation example:
        MyJob:
        Type: AWS::Glue::Job
        Properties:
          Command:
            Name: glueetl
            ScriptLocation: "s3://<your-S3-script-uri>"
          DefaultArguments:
            "--job-bookmark-option": "job-bookmark-enable"
          ExecutionProperty:
            MaxConcurrentRuns: 2
          MaxRetries: 0
          Name: cf-job1
          Role: !Ref MyJobRole
    

    AWS CDK example:

    import aws_cdk.aws_glue as glue
    
            myjob= glue.CfnJob(
                self,
                "MyJob",
                command=glue.CfnJob.JobCommandProperty(
                    name="glueetl",
                    python_version="3",
                    script_location="s3://<your-S3-script-uri>",
                ),
                role=MyJobRole,
                default_arguments={
                    "--job-bookmark-option": "job-bookmark-disable",
                   },
                execution_property=glue.CfnJob.ExecutionPropertyProperty(
                    max_concurrent_runs=1
                ),
                glue_version="4.0",
                max_retries=0,
                name="cf-job1",
                number_of_workers=2,
                worker_type="G.1X"
                )
    

    A simple job to read from a SQL Server database can be as follows:

    import sys
    from awsglue.transforms import *
    from awsglue.utils import getResolvedOptions
    from pyspark.context import SparkContext, SparkConf
    from awsglue.context import GlueContext
    from awsglue.job import Job
    from pyspark.sql import SQLContext
    from awsglue.dynamicframe import DynamicFrame
    
    args = getResolvedOptions(sys.argv, ['JOB_NAME'])
    sc = SparkContext()
    glue_context = GlueContext(sc)
    spark = glue_context.spark_session
    job = Job(glueContext)
    job.init(args['JOB_NAME'], args)
    
    connection_sqlserver_options = {
        "url": "",
        "dbtable": "",
        "user": "",
        "password": ""}
        
    df= glue_Context.create_dynamic_frame.from_options(connection_type="sqlserver",
                                                       connection_options=connection_sqlserver_options)
    

    You can also use any custom connector by specifying the path to its .jar:

    # Add the Spark SQL Server Connector package
    spark.sparkContext.addPyFile("path/to/spark-sqlserver-connector.jar")
    
    # Read data from SQL Server using the Spark SQL Server Connector
    df = spark.read.format("com.microsoft.sqlserver.jdbc.spark") 
        .option("url", "jdbc:sqlserver://your_sqlserver_host:1433;database=mydb") 
        .option("dbtable", "mytable") 
        .option("user", "your_username") 
        .option("password", "your_password") 
        .load()
    
    # Write the data frame to S3 as Parquet files
    s3_output_path = "s3://my-bucket/output/"
    df.write.mode("overwrite").parquet(s3_output_path)
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search