skip to Main Content

Dataframes saved to S3/S3A from Spark are unencrypted despite settings "fs.s3a.encryption.algorithm" and "fs.s3a.encryption.key" – Ubuntu

Description Within PySpark, even though a DataFrame can be saved to S3/S3A (not AWS, but a S3-compliant storage), its data are saved unencrypted despite that setting fs.s3a.encryption.algorithm (SSE-C) and fs.s3a.encryption.key are used. Reproducibility Generate the key as followed: encKey=$(openssl rand…

VIEW QUESTION

Visual Studio Code – Error when creating SparkSession in PySpark

When I am trying to create a sparksession I get this error: spark = SparkSession.builder.appName("Practice").getOrCreate() py4j.protocol.Py4JError: org.apache.spark.api.python.PythonUtils.getPythonAuthSocketTimeout does not exist in the JVM This is my code: import pyspark from pyspark.sql import SparkSession spark = SparkSession.builder.appName("Practice").getOrCreate() What am I doing…

VIEW QUESTION

Amazon web services – Why does AWS EMR PySpark get stuck when I try to aggregate dataframe

I'm running a Spark application in AWS EMR. The code is like this: with SparkSession.builder.appName(f"Spark App").getOrCreate() as spark: dataframe = spark.read.format('jdbc').options( ... ).load() print("Log A") max_date_result = dataframe.agg(max_(date_format('date', 'yyyy-MM-dd')).alias('max_date')).collect()[0] print("Log B") This application always gets stuck for a long time…

VIEW QUESTION
Back To Top
Search