Amazon web services - How can I overcome AWS Glue's extreme memory limitations?

JMini
June 6, 2024
158 views
1 vote
2 Answers

A Python Shell job cannot use more than one DPU. This means that it has a limit of 16 GB of memory.

Earlier today, I wired what I considered to be a modest ETL task to AWS Glue with 1 DPU. It was written in Python Shell. It ran for just over 10 minutes. In theory, this task queries a Microsoft SQL Server hosted in EC2 and produces 3,000,000 rows with about 250 columns and writes it to CSV. The CSV should come to about 2.5 GB. In practice, I get

Command Failed due to Out of Memory

from Glue. As far as I can tell, this error does not come from SQL; It comes from 1 DPU not being enough. Batching the queries and writes to CSV fixed this problem, but I did not wish to need to do that.

This confuses me very deeply. I do not consider 2.5 GB of data to be an unreasonable amount to ETL. As a person with a SQL background, I eat 2.5 GB for breakfast and I was doing this very same task in SSIS years ago. The Python does not do any complex manipulation of the data. It just grabs it from SQL and writes it to new CSV files on S3.

This gives me my question. AWS Glue is advertised as a cloud-scale ETL tool, but my experience described above indicates that it cannot manage modest ETL tasks. What am I missing and how can these limitations be overcome?

Answers

- davetapley
- June 6, 2024 at 3:06 am
- 0 votes
0
I’m not familiar with Glue, but have you seen duckdb?

It has a Python client, but is C++ under the hood and is very memory efficient compared to vanilla Python.

It has ODBC support so you can connect directly to the SQL Server, do the query, and use CSV export to write directly to a temp file, and then use use Python to write that file where ever you need it.

Login or Signup to reply.

The main issue is that you are using a simple Python shell job and Python’s memory management is not always optimized for handling large datasets efficiently. In this case, you should configure your job to run on Spark engine.

A Spark job is run in an Apache Spark environment managed by AWS Glue. You can still write your code in Python (using pyspark) the only difference is the underlying infrastructure. It is well optimized for memory-intensive tasks and also comes with optimized built-in connectors for many database systems (see here for more details).

There are different ways you can create and configure a Glue job.

You can use the visual console to create a job, you need to navigate to AWS Glue -> ETL Jobs -> Create job -> Author code with a script editor. Then you have to set the engine to Spark like this:

If you are using CloudFormation or AWS CDK you need to specify the name under command to glueetl which will set spark as the engine, as opposed to setting it to pythonshell for python shell jobs.

CloudFormation example:
    MyJob:
    Type: AWS::Glue::Job
    Properties:
      Command:
        Name: glueetl
        ScriptLocation: "s3://<your-S3-script-uri>"
      DefaultArguments:
        "--job-bookmark-option": "job-bookmark-enable"
      ExecutionProperty:
        MaxConcurrentRuns: 2
      MaxRetries: 0
      Name: cf-job1
      Role: !Ref MyJobRole

AWS CDK example:

import aws_cdk.aws_glue as glue

        myjob= glue.CfnJob(
            self,
            "MyJob",
            command=glue.CfnJob.JobCommandProperty(
                name="glueetl",
                python_version="3",
                script_location="s3://<your-S3-script-uri>",
            ),
            role=MyJobRole,
            default_arguments={
                "--job-bookmark-option": "job-bookmark-disable",
               },
            execution_property=glue.CfnJob.ExecutionPropertyProperty(
                max_concurrent_runs=1
            ),
            glue_version="4.0",
            max_retries=0,
            name="cf-job1",
            number_of_workers=2,
            worker_type="G.1X"
            )

A simple job to read from a SQL Server database can be as follows:

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext, SparkConf
from awsglue.context import GlueContext
from awsglue.job import Job
from pyspark.sql import SQLContext
from awsglue.dynamicframe import DynamicFrame

args = getResolvedOptions(sys.argv, ['JOB_NAME'])
sc = SparkContext()
glue_context = GlueContext(sc)
spark = glue_context.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)

connection_sqlserver_options = {
    "url": "",
    "dbtable": "",
    "user": "",
    "password": ""}
    
df= glue_Context.create_dynamic_frame.from_options(connection_type="sqlserver",
                                                   connection_options=connection_sqlserver_options)

You can also use any custom connector by specifying the path to its .jar:

# Add the Spark SQL Server Connector package
spark.sparkContext.addPyFile("path/to/spark-sqlserver-connector.jar")

# Read data from SQL Server using the Spark SQL Server Connector
df = spark.read.format("com.microsoft.sqlserver.jdbc.spark") 
    .option("url", "jdbc:sqlserver://your_sqlserver_host:1433;database=mydb") 
    .option("dbtable", "mytable") 
    .option("user", "your_username") 
    .option("password", "your_password") 
    .load()

# Write the data frame to S3 as Parquet files
s3_output_path = "s3://my-bucket/output/"
df.write.mode("overwrite").parquet(s3_output_path)

Please signup or login to give your own answer.

Click here to cancel reply.

Amazon web services – How can I overcome AWS Glue's extreme memory limitations?

Answers