Description
Within PySpark, even though a DataFrame can be saved to S3/S3A (not AWS, but a S3-compliant storage), its data are saved unencrypted despite that setting fs.s3a.encryption.algorithm
(SSE-C
) and fs.s3a.encryption.key
are used.
Reproducibility
Generate the key as followed:
encKey=$(openssl rand -base64 32)
Start PySpark shell:
pyspark --master spark://[some_host]:7077
--packages io.delta:delta-core_2.12:2.0.0,org.apache.hadoop:hadoop-aws:3.3.1,com.amazonaws:aws-java-sdk-bundle:1.11.901,org.apache.hadoop:hadoop-common:3.3.1
--conf "spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension"
--conf "spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog"
Within PySpark, a toy example:
sc._jsc.hadoopConfiguration().set("fs.s3a.access.key", "[access.key]")
sc._jsc.hadoopConfiguration().set("fs.s3a.secret.key", "[secret.key]")
sc._jsc.hadoopConfiguration().set("fs.s3a.endpoint", "[s3.compliant.endpoint]")
sc._jsc.hadoopConfiguration().set("fs.s3a.encryption.algorithm", "SSE-C")
sc._jsc.hadoopConfiguration().set("fs.s3a.encryption.key", "[the_encKey_above]")
from pyspark.sql.types import StructType,StructField, StringType, IntegerType
data2 = [("James","","Smith","36636","M",3000),
("Michael","Rose","","40288","M",4000),
("Robert","","Williams","42114","M",4000),
("Maria","Anne","Jones","39192","F",4000),
("Jen","Mary","Brown","","F",-1)
]
schema = StructType([
StructField("firstname",StringType(),True),
StructField("middlename",StringType(),True),
StructField("lastname",StringType(),True),
StructField("id", StringType(), True),
StructField("gender", StringType(), True),
StructField("salary", IntegerType(), True)
])
df = spark.createDataFrame(data=data2,schema=schema)
#df.write.format("csv").option("header", "true").save("s3a://data/test")
df.repartition(1).write.format("csv").option("header", "true").save("s3a://data/test")
We can see the folder s3a://data/test
created, and there is a CSV file in there. But unfortunately, the file is not encrypted. That said, it can be even downloaded manually through web browser and then viewed with notepad! The setting fs.s3a.encryption.algorithm
seems to be ignored.
Environment
- Apache Spark v3.2.2
- Hadoop-aws v3.3.1 / Hadoop-common v3.3.1
- openjdk 11.0.16.1 2022-08-12 (Temurin)
- Python 3.10.4
- Ubuntu 22.04 LTS
Debug
Interestingly, the same endpoint has no problem with encrypting the uploaded file, if using AWS CLI:
aws --endpoint-url https://[s3.compliant.endpoint]
s3api put-object
--body "/home/[a_user]/Desktop/a_file.csv"
--bucket "data"
--key "test/a_file.csv"
--sse-customer-algorithm AES256
--sse-customer-key $encKey
--sse-customer-key-md5 $md5Key
aws -version
# aws-cli/2.7.35 Python/3.9.11 Linux/5.15.0-48-generic exe/x86_64.ubuntu.22 prompt/off
I read the manual guide Working with Encrypted S3 Data, but no help.
2
Answers
Found a solution:
The libraries
hadoop-aws v3.3.1
/aws-java-sdk-bundle v1.11.901
/hadoop-common v3.3.1
must have some bugs, or be simply incompatible with the S3's latest protocol (as of 2022.10.06). The aforementioned issue can be overcome by using the latest versions of the libraries, meaning:hadoop-aws v3.3.4
(thus,aws-java-sdk-bundle v1.12.262
dependency), andhadoop-common v3.3.4
(to match thehadoop-aws
).Among the release notes from 3.3.2 to 3.3.4, there may be some fixes related to AWS/S3 that help overcome the issue.
Summarily, for PySpark, use:
When using hadoop 3.3.1 or earlier, use fs.s3a.server-side-encryption-algorithm and "fs.s3a.server-side-encryption.key"; the "fs.s3a.encryption.*" options only came in hadoop 3.3.2 and support for client side encryption in HADOOP-13887