A Python Shell job cannot use more than one DPU. This means that it has a limit of 16 GB of memory.
Earlier today, I wired what I considered to be a modest ETL task to AWS Glue with 1 DPU. It was written in Python Shell. It ran for just over 10 minutes. In theory, this task queries a Microsoft SQL Server hosted in EC2 and produces 3,000,000 rows with about 250 columns and writes it to CSV. The CSV should come to about 2.5 GB. In practice, I get
Command Failed due to Out of Memory
from Glue. As far as I can tell, this error does not come from SQL; It comes from 1 DPU not being enough. Batching the queries and writes to CSV fixed this problem, but I did not wish to need to do that.
This confuses me very deeply. I do not consider 2.5 GB of data to be an unreasonable amount to ETL. As a person with a SQL background, I eat 2.5 GB for breakfast and I was doing this very same task in SSIS years ago. The Python does not do any complex manipulation of the data. It just grabs it from SQL and writes it to new CSV files on S3.
This gives me my question. AWS Glue is advertised as a cloud-scale ETL tool, but my experience described above indicates that it cannot manage modest ETL tasks. What am I missing and how can these limitations be overcome?
2
Answers
I’m not familiar with Glue, but have you seen duckdb?
It has a Python client, but is C++ under the hood and is very memory efficient compared to vanilla Python.
It has ODBC support so you can connect directly to the SQL Server, do the query, and use CSV export to write directly to a temp file, and then use use Python to write that file where ever you need it.
The main issue is that you are using a simple Python shell job and Python’s memory management is not always optimized for handling large datasets efficiently. In this case, you should configure your job to run on Spark engine.
A Spark job is run in an Apache Spark environment managed by AWS Glue. You can still write your code in Python (using pyspark) the only difference is the underlying infrastructure. It is well optimized for memory-intensive tasks and also comes with optimized built-in connectors for many database systems (see here for more details).
There are different ways you can create and configure a Glue job.
You can use the visual console to create a job, you need to navigate to AWS Glue -> ETL Jobs -> Create job -> Author code with a script editor. Then you have to set the engine to
Spark
like this:If you are using
CloudFormation
orAWS CDK
you need to specify thename
undercommand
toglueetl
which will set spark as the engine, as opposed to setting it topythonshell
for python shell jobs.AWS CDK example:
A simple job to read from a SQL Server database can be as follows:
You can also use any custom connector by specifying the path to its
.jar
: