Amazon web services - AWS Glue locally: convert pandas df to delta

Finja
September 19, 2023
287 views
0 votes
2 Answers

I run my AWS Glue jobs locally in a docker container (AWS Glue lib 4.0) and want to convert/write a pandas dataframe to delta format.

I added

spark = SparkSession.builder 
        .appName("YourAppName") 
        .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") 
        .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") 
        .getOrCreate()
    sys.argv += ['--datalake-formats', 'delta']
    args = getResolvedOptions(sys.argv, ['datalake-formats'])

but this line

spark.createDataFrame(pandas_df).write.format('delta').save('myfile.delta')

give me still the error Failed to find data source: delta.

I dont’ get what iam missing here.

Answers

Chosen as BEST ANSWER
- Finja
- September 10, 2023 at 12:53 pm
- 0 votes
0
Found the answer in AWS blog post :

"Glue 4.0: Add native data lake libraries AWS Glue 4.0 Docker image supports native data lake libraries; Apache Hudi, Delta Lake, and Apache Iceberg. You can pass the environment variable DATALAKE_FORMATS to load the relevant JAR files.

-e DATALAKE_FORMATS=hudi,delta,iceberg"

When you set this env variable starting your docker container it will do following
```
Adding delta-2.1.0 libs to Spark Classpath
```

(Edit)

- AlexOtt
- September 9, 2023 at 2:09 pm
- 0 votes
0
I’m not a Glue expert, but it looks like that you’re specifying --datalake-formats too late, when job is already started. Per documentation, you need to specify that parameter should be specified in the aws glue start-job-run ..., not in your code.

Login or Signup to reply.

Please signup or login to give your own answer.

Click here to cancel reply.

Amazon web services – AWS Glue locally: convert pandas df to delta

Answers