I am trying to add a new column called timestamp
in AWS Glue.
My upstream data keys have capital letters.
However, after adding the column timestamp
, the keys of the remaining columns got lowercased.
Experiment 1: Transform – SQL Query
select
cast((Absolute_Time - 2082844800) * 1000 as bigint) as timestamp,
*
from
engine_demo;
However, it lowercases all my keys.
Experiment 2: Transform – Custom code
def MyTransform (glueContext, dfc) -> DynamicFrameCollection:
df = dfc.select(list(dfc.keys())[0]).toDF()
from pyspark.sql import functions as F
df = df.withColumn(
"timestamp",
((F.col("Absolute_Time") - 2082844800) * 1000).cast("long")
)
dynamic_frame = DynamicFrame.fromDF(df, glueContext, "changed")
return (DynamicFrameCollection({"result_dynf": dynamic_frame}, glueContext))
It also lowercases all my keys
Experiment 3
I found Enable case sensitivity for spark.sql globally
However, there is no place to set in "Transform – SQL Query" node, so I try to set the config in "Transform – Custom" code:
def MyTransform (glueContext, dfc) -> DynamicFrameCollection:
glueContext.spark_session.conf.set("spark.sql.caseSensitive", "true")
df = dfc.select(list(dfc.keys())[0]).toDF()
from pyspark.sql import functions as F
df = df.withColumn(
"timestamp",
((F.col("Absolute_Time") - 2082844800) * 1000).cast("long")
)
dynamic_frame = DynamicFrame.fromDF(df, glueContext, "changed")
return (DynamicFrameCollection({"dynamic_frame": dynamic_frame}, glueContext))
However, this above code gives same result.
Is there a way to preserve the key letter case in the AWS Glue Transform node? Thanks!
2
Answers
Here is my experiment result, if I remove this Transform "convert time" step and sink to Delta Lake directly. It will preserve the letter case.
However, as I am using Glue Crawler to register the Delta table in the Glue Catalog.
Here is some info I got from Jay Naves from AWS team (thanks!):
So even I preserved the case in Delta table. The downstream will still become lowercase.
I am ending up with this solution:
Since SQL query is designed to be case insensitive by default,
1.
will return
Absolute_Time
.2.
will return
absolute_time
3.
will return
absolute_time
.Most time, in dashboard, users can query using first way to have correct letter case.
If any user needs to use
select *
, I can provide a dict to help map from lowercase to correct letter case.We can apply case sensitive flag in ApplyMapping section in the ETL job script. Please see this answer. Hope it helps!