Dataframes saved to S3/S3A from Spark are unencrypted despite settings "fs.s3a.encryption.algorithm" and "fs.s3a.encryption.key" - Ubuntu

JoyfulPanda
October 5, 2022
83 views
0 votes
2 Answers

Description

Within PySpark, even though a DataFrame can be saved to S3/S3A (not AWS, but a S3-compliant storage), its data are saved unencrypted despite that setting fs.s3a.encryption.algorithm (SSE-C) and fs.s3a.encryption.key are used.

Reproducibility

Generate the key as followed:

encKey=$(openssl rand -base64 32)

Start PySpark shell:

pyspark --master spark://[some_host]:7077 
    --packages io.delta:delta-core_2.12:2.0.0,org.apache.hadoop:hadoop-aws:3.3.1,com.amazonaws:aws-java-sdk-bundle:1.11.901,org.apache.hadoop:hadoop-common:3.3.1 
    --conf "spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension" 
    --conf "spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog"

Within PySpark, a toy example:

sc._jsc.hadoopConfiguration().set("fs.s3a.access.key", "[access.key]")
sc._jsc.hadoopConfiguration().set("fs.s3a.secret.key", "[secret.key]")
sc._jsc.hadoopConfiguration().set("fs.s3a.endpoint", "[s3.compliant.endpoint]")
sc._jsc.hadoopConfiguration().set("fs.s3a.encryption.algorithm", "SSE-C")
sc._jsc.hadoopConfiguration().set("fs.s3a.encryption.key", "[the_encKey_above]")

from pyspark.sql.types import StructType,StructField, StringType, IntegerType
data2 = [("James","","Smith","36636","M",3000),
    ("Michael","Rose","","40288","M",4000),
    ("Robert","","Williams","42114","M",4000),
    ("Maria","Anne","Jones","39192","F",4000),
    ("Jen","Mary","Brown","","F",-1)
  ]

schema = StructType([ 
    StructField("firstname",StringType(),True), 
    StructField("middlename",StringType(),True), 
    StructField("lastname",StringType(),True), 
    StructField("id", StringType(), True), 
    StructField("gender", StringType(), True), 
    StructField("salary", IntegerType(), True) 
  ])
 
df = spark.createDataFrame(data=data2,schema=schema)

#df.write.format("csv").option("header", "true").save("s3a://data/test")

df.repartition(1).write.format("csv").option("header", "true").save("s3a://data/test")

We can see the folder s3a://data/test created, and there is a CSV file in there. But unfortunately, the file is not encrypted. That said, it can be even downloaded manually through web browser and then viewed with notepad! The setting fs.s3a.encryption.algorithm seems to be ignored.

Environment

Apache Spark v3.2.2
Hadoop-aws v3.3.1 / Hadoop-common v3.3.1
openjdk 11.0.16.1 2022-08-12 (Temurin)
Python 3.10.4
Ubuntu 22.04 LTS

Debug

Interestingly, the same endpoint has no problem with encrypting the uploaded file, if using AWS CLI:

aws --endpoint-url https://[s3.compliant.endpoint] 
    s3api put-object 
    --body "/home/[a_user]/Desktop/a_file.csv" 
    --bucket "data" 
    --key "test/a_file.csv" 
    --sse-customer-algorithm AES256 
    --sse-customer-key $encKey 
    --sse-customer-key-md5 $md5Key

aws -version
# aws-cli/2.7.35 Python/3.9.11 Linux/5.15.0-48-generic exe/x86_64.ubuntu.22 prompt/off

I read the manual guide Working with Encrypted S3 Data, but no help.

Answers

Chosen as BEST ANSWER
- JoyfulPanda
- October 6, 2022 at 7:25 pm
- 0 votes
0
Found a solution:

The libraries hadoop-aws v3.3.1 / aws-java-sdk-bundle v1.11.901 / hadoop-common v3.3.1 must have some bugs, or be simply incompatible with the S3's latest protocol (as of 2022.10.06). The aforementioned issue can be overcome by using the latest versions of the libraries, meaning: hadoop-aws v3.3.4 (thus, aws-java-sdk-bundle v1.12.262 dependency), and hadoop-common v3.3.4 (to match the hadoop-aws).

Among the release notes from 3.3.2 to 3.3.4, there may be some fixes related to AWS/S3 that help overcome the issue.
Summarily, for PySpark, use:
```
pyspark --master spark://[some_host]:7077 
    --packages io.delta:delta-core_2.12:2.0.0,org.apache.hadoop:hadoop-aws:3.3.4,com.amazonaws:aws-java-sdk-bundle:1.12.262,org.apache.hadoop:hadoop-common:3.3.4 
    --conf "spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension" 
    ...
```

(Edit)

- stevel
- October 7, 2022 at 1:53 pm
- 0 votes
0
When using hadoop 3.3.1 or earlier, use fs.s3a.server-side-encryption-algorithm and "fs.s3a.server-side-encryption.key"; the "fs.s3a.encryption.*" options only came in hadoop 3.3.2 and support for client side encryption in HADOOP-13887

Login or Signup to reply.

Please signup or login to give your own answer.

Click here to cancel reply.

Dataframes saved to S3/S3A from Spark are unencrypted despite settings "fs.s3a.encryption.algorithm" and "fs.s3a.encryption.key" – Ubuntu

Description

Reproducibility

Environment

Debug

Answers