skip to Main Content

I run my AWS Glue jobs locally in a docker container (AWS Glue lib 4.0) and want to convert/write a pandas dataframe to delta format.

I added

spark = SparkSession.builder 
        .appName("YourAppName") 
        .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") 
        .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") 
        .getOrCreate()
    sys.argv += ['--datalake-formats', 'delta']
    args = getResolvedOptions(sys.argv, ['datalake-formats'])

but this line

spark.createDataFrame(pandas_df).write.format('delta').save('myfile.delta')

give me still the error Failed to find data source: delta.

I dont’ get what iam missing here.

2

Answers


  1. Chosen as BEST ANSWER

    Found the answer in AWS blog post :

    "Glue 4.0: Add native data lake libraries AWS Glue 4.0 Docker image supports native data lake libraries; Apache Hudi, Delta Lake, and Apache Iceberg. You can pass the environment variable DATALAKE_FORMATS to load the relevant JAR files.

    -e DATALAKE_FORMATS=hudi,delta,iceberg"

    When you set this env variable starting your docker container it will do following

    Adding delta-2.1.0 libs to Spark Classpath
    

  2. I’m not a Glue expert, but it looks like that you’re specifying --datalake-formats too late, when job is already started. Per documentation, you need to specify that parameter should be specified in the aws glue start-job-run ..., not in your code.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search