skip to Main Content

Description

Within PySpark, even though a DataFrame can be saved to S3/S3A (not AWS, but a S3-compliant storage), its data are saved unencrypted despite that setting fs.s3a.encryption.algorithm (SSE-C) and fs.s3a.encryption.key are used.

Reproducibility

Generate the key as followed:

encKey=$(openssl rand -base64 32)

Start PySpark shell:

pyspark --master spark://[some_host]:7077 
    --packages io.delta:delta-core_2.12:2.0.0,org.apache.hadoop:hadoop-aws:3.3.1,com.amazonaws:aws-java-sdk-bundle:1.11.901,org.apache.hadoop:hadoop-common:3.3.1 
    --conf "spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension" 
    --conf "spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog"

Within PySpark, a toy example:

sc._jsc.hadoopConfiguration().set("fs.s3a.access.key", "[access.key]")
sc._jsc.hadoopConfiguration().set("fs.s3a.secret.key", "[secret.key]")
sc._jsc.hadoopConfiguration().set("fs.s3a.endpoint", "[s3.compliant.endpoint]")
sc._jsc.hadoopConfiguration().set("fs.s3a.encryption.algorithm", "SSE-C")
sc._jsc.hadoopConfiguration().set("fs.s3a.encryption.key", "[the_encKey_above]")

from pyspark.sql.types import StructType,StructField, StringType, IntegerType
data2 = [("James","","Smith","36636","M",3000),
    ("Michael","Rose","","40288","M",4000),
    ("Robert","","Williams","42114","M",4000),
    ("Maria","Anne","Jones","39192","F",4000),
    ("Jen","Mary","Brown","","F",-1)
  ]

schema = StructType([ 
    StructField("firstname",StringType(),True), 
    StructField("middlename",StringType(),True), 
    StructField("lastname",StringType(),True), 
    StructField("id", StringType(), True), 
    StructField("gender", StringType(), True), 
    StructField("salary", IntegerType(), True) 
  ])
 
df = spark.createDataFrame(data=data2,schema=schema)

#df.write.format("csv").option("header", "true").save("s3a://data/test")

df.repartition(1).write.format("csv").option("header", "true").save("s3a://data/test")

We can see the folder s3a://data/test created, and there is a CSV file in there. But unfortunately, the file is not encrypted. That said, it can be even downloaded manually through web browser and then viewed with notepad! The setting fs.s3a.encryption.algorithm seems to be ignored.

Environment

  • Apache Spark v3.2.2
  • Hadoop-aws v3.3.1 / Hadoop-common v3.3.1
  • openjdk 11.0.16.1 2022-08-12 (Temurin)
  • Python 3.10.4
  • Ubuntu 22.04 LTS

Debug

Interestingly, the same endpoint has no problem with encrypting the uploaded file, if using AWS CLI:

aws --endpoint-url https://[s3.compliant.endpoint] 
    s3api put-object 
    --body "/home/[a_user]/Desktop/a_file.csv" 
    --bucket "data" 
    --key "test/a_file.csv" 
    --sse-customer-algorithm AES256 
    --sse-customer-key $encKey 
    --sse-customer-key-md5 $md5Key
aws -version
# aws-cli/2.7.35 Python/3.9.11 Linux/5.15.0-48-generic exe/x86_64.ubuntu.22 prompt/off

I read the manual guide Working with Encrypted S3 Data, but no help.

2

Answers


  1. Chosen as BEST ANSWER

    Found a solution:

    The libraries hadoop-aws v3.3.1 / aws-java-sdk-bundle v1.11.901 / hadoop-common v3.3.1 must have some bugs, or be simply incompatible with the S3's latest protocol (as of 2022.10.06). The aforementioned issue can be overcome by using the latest versions of the libraries, meaning: hadoop-aws v3.3.4 (thus, aws-java-sdk-bundle v1.12.262 dependency), and hadoop-common v3.3.4 (to match the hadoop-aws).

    Among the release notes from 3.3.2 to 3.3.4, there may be some fixes related to AWS/S3 that help overcome the issue.

    Summarily, for PySpark, use:

    pyspark --master spark://[some_host]:7077 
        --packages io.delta:delta-core_2.12:2.0.0,org.apache.hadoop:hadoop-aws:3.3.4,com.amazonaws:aws-java-sdk-bundle:1.12.262,org.apache.hadoop:hadoop-common:3.3.4 
        --conf "spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension" 
        ...
    

  2. When using hadoop 3.3.1 or earlier, use fs.s3a.server-side-encryption-algorithm and "fs.s3a.server-side-encryption.key"; the "fs.s3a.encryption.*" options only came in hadoop 3.3.2 and support for client side encryption in HADOOP-13887

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search