I run my AWS Glue jobs locally in a docker container (AWS Glue lib 4.0) and want to convert/write a pandas dataframe to delta format.
I added
spark = SparkSession.builder
.appName("YourAppName")
.config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension")
.config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog")
.getOrCreate()
sys.argv += ['--datalake-formats', 'delta']
args = getResolvedOptions(sys.argv, ['datalake-formats'])
but this line
spark.createDataFrame(pandas_df).write.format('delta').save('myfile.delta')
give me still the error Failed to find data source: delta.
I dont’ get what iam missing here.
2
Answers
Found the answer in AWS blog post :
"Glue 4.0: Add native data lake libraries AWS Glue 4.0 Docker image supports native data lake libraries; Apache Hudi, Delta Lake, and Apache Iceberg. You can pass the environment variable DATALAKE_FORMATS to load the relevant JAR files.
-e DATALAKE_FORMATS=hudi,delta,iceberg
"When you set this env variable starting your docker container it will do following
I’m not a Glue expert, but it looks like that you’re specifying
--datalake-formats
too late, when job is already started. Per documentation, you need to specify that parameter should be specified in theaws glue start-job-run ...
, not in your code.