i’m trying to play around with the platforms, creating some sort of data lake/warehouse.
So i have the minio instance on localhost:9000, and can i use the ui and upload files. great.
I create the network named spark network and started:
docker volume create s3_drive
docker run --network spark_network --network-alias iceberg-lake.minio -p 9000:9000 -p 9001:9001 -v s3_drive:/data -e MINIO_ROOT_USER=username -e MINIO_ROOT_PASSWORD=password -d --name minio quay.io/minio/minio server /data --console-address ":9001"
I’ve download the docker image
FROM quay.io/jupyter/all-spark-notebook
# Set the working directory to /code
WORKDIR /code
# Copy PostgreSQL driver
COPY postgresql-42.7.1.jar /opt/jars/postgresql-42.7.1.jar
# Expose Spark UI and JupyterLab ports
EXPOSE 4040 8888
# Start JupyterLab on container start
CMD ["start-notebook.sh", "--NotebookApp.token=''"]
followed by
docker volume create jupyter_code
docker run -p 8888:8888 -p 4040:4040 -e AWS_REGION=us-east-1 -e MINIO_REGION=us-east-1 -e AWS_S3_ENDPOINT=http://minio:9000 -e AWS_ACCESS_KEY_ID=key_id -e AWS_SECRET_ACCESS_KEY=secret_key --network spark_network -v jupyter_code:/code -d --name spark spark-jupyterlab
so everything is up, and i have another postgres on the network.
when i try to read from the minio server with spark (jupyter lab), i’m creating the session:
from pyspark.sql import SparkSession
spark = SparkSession.builder
.appName("iceberg_jdbc")
.config("spark.jars", "/opt/jars/postgresql-42.7.1.jar")
.config('spark.jars.packages', 'org.apache.hadoop:hadoop-aws:3.3.4,org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.4.2,org.apache.iceberg:iceberg-aws-bundle:1.4.2')
.config('spark.sql.extensions', 'org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions')
.config('spark.sql.catalog.cves', 'org.apache.iceberg.spark.SparkCatalog')
.config('spark.sql.catalog.cves.io-impl', 'org.apache.iceberg.aws.s3.S3FileIO')
.config('spark.sql.catalog.cves.warehouse', 's3a://iceberg-lake/lake/')
.config('spark.sql.catalog.cves.s3.endpoint', 'http://minio:9000')
.config('spark.sql.catalog.cves.s3.path-style-access', 'true')
.config('spark.sql.catalog.cves.catalog-impl', 'org.apache.iceberg.jdbc.JdbcCatalog')
.config('spark.sql.catalog.cves.uri', 'jdbc:postgresql://postgres-test:5432/metastore')
.config('spark.sql.catalog.cves.jdbc.verifyServerCertificate', 'false')
.config('spark.sql.catalog.cves.jdbc.useSSL','false')
.config('spark.sql.catalog.cves.jdbc.user', '<pg user>')
.config('spark.sql.catalog.cves.jdbc.password','<pg password>')
.config("spark.hadoop.fs.s3a.access.key", "<access-key>")
.config("spark.hadoop.fs.s3a.secret.key", "<secret-key>")
.getOrCreate()
after all that, i’m trying to read one json file from bucket on minio called "iceberg-lake".
i tried to do it wihtout changing or adding any policy or user (but of course with access keys), and also with user and policy and what not.
the end result is always the same, when i try to read like this
df = spark.read.json("s3a://iceberg-lake/lake/1999/0xxx/CVE-1999-0001.json")
i’m getting this:
Py4JJavaError: An error occurred while calling o60.json.
: java.nio.file.AccessDeniedException: s3a://iceberg-lake/lake/1999/0xxx/CVE-1999-0001.json: getFileStatus on s3a://iceberg-lake/lake/1999/0xxx/CVE-1999-0001.json: com.amazonaws.services.s3.model.AmazonS3Exception: Forbidden (Service: Amazon S3; Status Code: 403; Error Code: 403 Forbidden; Request ID: AGJWYZYKWD26CZXH; S3 Extended Request ID: 0haOOhMeMtsgJN5OMwmehLBWVFkDqZmy8WHvA+Tiym4IBNR0x88kIPEH3ddonlVPS6/FXcOyfrI=; Proxy: null), S3 Extended Request ID: 0haOOhMeMtsgJN5OMwmehLBWVFkDqZmy8WHvA+Tiym4IBNR0x88kIPEH3ddonlVPS6/FXcOyfrI=:403 Forbidden
at org.apache.hadoop.fs.s3a.S3AUtils.translateException(S3AUtils.java:255)
at org.apache.hadoop.fs.s3a.S3AUtils.translateException(S3AUtils.java:175)
at org.apache.hadoop.fs.s3a.S3AFileSystem.s3GetFileStatus(S3AFileSystem.java:3796)
at org.apache.hadoop.fs.s3a.S3AFileSystem.innerGetFileStatus(S3AFileSystem.java:3688)
at org.apache.hadoop.fs.s3a.S3AFileSystem.lambda$exists$34(S3AFileSystem.java:4703)
at org.apache.hadoop.fs.statistics.impl.IOStatisticsBinding.lambda$trackDurationOfOperation$5(IOStatisticsBinding.java:499)
at org.apache.hadoop.fs.statistics.impl.IOStatisticsBinding.trackDuration(IOStatisticsBinding.java:444)
at org.apache.hadoop.fs.s3a.S3AFileSystem.trackDurationAndSpan(S3AFileSystem.java:2337)
at org.apache.hadoop.fs.s3a.S3AFileSystem.trackDurationAndSpan(S3AFileSystem.java:2356)
at org.apache.hadoop.fs.s3a.S3AFileSystem.exists(S3AFileSystem.java:4701)
at org.apache.spark.sql.execution.datasources.DataSource$.$anonfun$checkAndGlobPathIfNecessary$4(DataSource.scala:756)
at org.apache.spark.sql.execution.datasources.DataSource$.$anonfun$checkAndGlobPathIfNecessary$4$adapted(DataSource.scala:754)
at org.apache.spark.util.ThreadUtils$.$anonfun$parmap$2(ThreadUtils.scala:380)
at scala.concurrent.Future$.$anonfun$apply$1(Future.scala:659)
at scala.util.Success.$anonfun$map$1(Try.scala:255)
at scala.util.Success.map(Try.scala:213)
at scala.concurrent.Future.$anonfun$map$1(Future.scala:292)
at scala.concurrent.impl.Promise.liftedTree1$1(Promise.scala:33)
at scala.concurrent.impl.Promise.$anonfun$transform$1(Promise.scala:33)
at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:64)
at java.base/java.util.concurrent.ForkJoinTask$RunnableExecuteAction.exec(ForkJoinTask.java:1395)
at java.base/java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:373)
at java.base/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(ForkJoinPool.java:1182)
at java.base/java.util.concurrent.ForkJoinPool.scan(ForkJoinPool.java:1655)
at java.base/java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1622)
I tried different policies and different keys, with and without users. i deleted all the images and build from scratch again. tried to play around with new container and all.
still, i can’t read one json file (in the end i want to read the whole folder of course)
can anybody help?
2
Answers
the fact the error comes back with an extended request id implies the response is coming from aws, not minio
why don’t you set "spark.hadoop.fs.s3a.endpoint" to the endpoint, that way the s3a connector gets the value?
I have a docker compose for this kind of evaluation setup:
For direction on the whole setup I have it documented in this this blog