Docker - read files with spark from minio

Alon
January 22, 2024
189 views
0 votes
2 Answers

i’m trying to play around with the platforms, creating some sort of data lake/warehouse.
So i have the minio instance on localhost:9000, and can i use the ui and upload files. great.

I create the network named spark network and started:

docker volume create s3_drive
docker run --network spark_network --network-alias iceberg-lake.minio -p 9000:9000 -p 9001:9001 -v s3_drive:/data -e MINIO_ROOT_USER=username -e MINIO_ROOT_PASSWORD=password -d --name minio quay.io/minio/minio server /data --console-address ":9001"

I’ve download the docker image

FROM quay.io/jupyter/all-spark-notebook

# Set the working directory to /code
WORKDIR /code

# Copy PostgreSQL driver
COPY postgresql-42.7.1.jar /opt/jars/postgresql-42.7.1.jar

# Expose Spark UI and JupyterLab ports
EXPOSE 4040 8888

# Start JupyterLab on container start
CMD ["start-notebook.sh", "--NotebookApp.token=''"]

followed by

docker volume create jupyter_code
docker run -p 8888:8888 -p 4040:4040 -e AWS_REGION=us-east-1 -e MINIO_REGION=us-east-1 -e AWS_S3_ENDPOINT=http://minio:9000 -e AWS_ACCESS_KEY_ID=key_id -e AWS_SECRET_ACCESS_KEY=secret_key --network spark_network -v jupyter_code:/code -d --name spark spark-jupyterlab

so everything is up, and i have another postgres on the network.

when i try to read from the minio server with spark (jupyter lab), i’m creating the session:

from pyspark.sql import SparkSession

spark = SparkSession.builder 
    .appName("iceberg_jdbc") 
    .config("spark.jars", "/opt/jars/postgresql-42.7.1.jar") 
    .config('spark.jars.packages', 'org.apache.hadoop:hadoop-aws:3.3.4,org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.4.2,org.apache.iceberg:iceberg-aws-bundle:1.4.2') 
    .config('spark.sql.extensions', 'org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions') 
    .config('spark.sql.catalog.cves', 'org.apache.iceberg.spark.SparkCatalog') 
    .config('spark.sql.catalog.cves.io-impl', 'org.apache.iceberg.aws.s3.S3FileIO') 
    .config('spark.sql.catalog.cves.warehouse', 's3a://iceberg-lake/lake/') 
    .config('spark.sql.catalog.cves.s3.endpoint', 'http://minio:9000') 
    .config('spark.sql.catalog.cves.s3.path-style-access', 'true') 
    .config('spark.sql.catalog.cves.catalog-impl', 'org.apache.iceberg.jdbc.JdbcCatalog') 
    .config('spark.sql.catalog.cves.uri', 'jdbc:postgresql://postgres-test:5432/metastore') 
    .config('spark.sql.catalog.cves.jdbc.verifyServerCertificate', 'false') 
    .config('spark.sql.catalog.cves.jdbc.useSSL','false') 
    .config('spark.sql.catalog.cves.jdbc.user', '<pg user>') 
    .config('spark.sql.catalog.cves.jdbc.password','<pg password>') 
    .config("spark.hadoop.fs.s3a.access.key", "<access-key>") 
    .config("spark.hadoop.fs.s3a.secret.key", "<secret-key>") 
    .getOrCreate()

after all that, i’m trying to read one json file from bucket on minio called "iceberg-lake".

i tried to do it wihtout changing or adding any policy or user (but of course with access keys), and also with user and policy and what not.
the end result is always the same, when i try to read like this

df = spark.read.json("s3a://iceberg-lake/lake/1999/0xxx/CVE-1999-0001.json")

i’m getting this:

Py4JJavaError: An error occurred while calling o60.json.
: java.nio.file.AccessDeniedException: s3a://iceberg-lake/lake/1999/0xxx/CVE-1999-0001.json: getFileStatus on s3a://iceberg-lake/lake/1999/0xxx/CVE-1999-0001.json: com.amazonaws.services.s3.model.AmazonS3Exception: Forbidden (Service: Amazon S3; Status Code: 403; Error Code: 403 Forbidden; Request ID: AGJWYZYKWD26CZXH; S3 Extended Request ID: 0haOOhMeMtsgJN5OMwmehLBWVFkDqZmy8WHvA+Tiym4IBNR0x88kIPEH3ddonlVPS6/FXcOyfrI=; Proxy: null), S3 Extended Request ID: 0haOOhMeMtsgJN5OMwmehLBWVFkDqZmy8WHvA+Tiym4IBNR0x88kIPEH3ddonlVPS6/FXcOyfrI=:403 Forbidden
    at org.apache.hadoop.fs.s3a.S3AUtils.translateException(S3AUtils.java:255)
    at org.apache.hadoop.fs.s3a.S3AUtils.translateException(S3AUtils.java:175)
    at org.apache.hadoop.fs.s3a.S3AFileSystem.s3GetFileStatus(S3AFileSystem.java:3796)
    at org.apache.hadoop.fs.s3a.S3AFileSystem.innerGetFileStatus(S3AFileSystem.java:3688)
    at org.apache.hadoop.fs.s3a.S3AFileSystem.lambda$exists$34(S3AFileSystem.java:4703)
    at org.apache.hadoop.fs.statistics.impl.IOStatisticsBinding.lambda$trackDurationOfOperation$5(IOStatisticsBinding.java:499)
    at org.apache.hadoop.fs.statistics.impl.IOStatisticsBinding.trackDuration(IOStatisticsBinding.java:444)
    at org.apache.hadoop.fs.s3a.S3AFileSystem.trackDurationAndSpan(S3AFileSystem.java:2337)
    at org.apache.hadoop.fs.s3a.S3AFileSystem.trackDurationAndSpan(S3AFileSystem.java:2356)
    at org.apache.hadoop.fs.s3a.S3AFileSystem.exists(S3AFileSystem.java:4701)
    at org.apache.spark.sql.execution.datasources.DataSource$.$anonfun$checkAndGlobPathIfNecessary$4(DataSource.scala:756)
    at org.apache.spark.sql.execution.datasources.DataSource$.$anonfun$checkAndGlobPathIfNecessary$4$adapted(DataSource.scala:754)
    at org.apache.spark.util.ThreadUtils$.$anonfun$parmap$2(ThreadUtils.scala:380)
    at scala.concurrent.Future$.$anonfun$apply$1(Future.scala:659)
    at scala.util.Success.$anonfun$map$1(Try.scala:255)
    at scala.util.Success.map(Try.scala:213)
    at scala.concurrent.Future.$anonfun$map$1(Future.scala:292)
    at scala.concurrent.impl.Promise.liftedTree1$1(Promise.scala:33)
    at scala.concurrent.impl.Promise.$anonfun$transform$1(Promise.scala:33)
    at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:64)
    at java.base/java.util.concurrent.ForkJoinTask$RunnableExecuteAction.exec(ForkJoinTask.java:1395)
    at java.base/java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:373)
    at java.base/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(ForkJoinPool.java:1182)
    at java.base/java.util.concurrent.ForkJoinPool.scan(ForkJoinPool.java:1655)
    at java.base/java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1622)

I tried different policies and different keys, with and without users. i deleted all the images and build from scratch again. tried to play around with new container and all.
still, i can’t read one json file (in the end i want to read the whole folder of course)

can anybody help?

Answers

- stevel
- January 1, 2024 at 4:59 pm
- 0 votes
0
the fact the error comes back with an extended request id implies the response is coming from aws, not minio

why don’t you set "spark.hadoop.fs.s3a.endpoint" to the endpoint, that way the s3a connector gets the value?

Login or Signup to reply.

I have a docker compose for this kind of evaluation setup:

version: "3.9"

services:
  dremio:
    platform: linux/x86_64
    image: dremio/dremio-oss:latest
    ports:
      - 9047:9047
      - 31010:31010
      - 32010:32010
    container_name: dremio

  minioserver:
    image: minio/minio
    ports:
      - 9000:9000
      - 9001:9001
    environment:
      MINIO_ROOT_USER: minioadmin
      MINIO_ROOT_PASSWORD: minioadmin
    container_name: minio
    command: server /data --console-address ":9001"

  spark_notebook:
    image: alexmerced/spark33-notebook
    ports: 
      - 8888:8888
    env_file: .env
    container_name: notebook

  nessie:
    image: projectnessie/nessie
    container_name: nessie
    ports:
      - "19120:19120"

networks:
  default:
    name: iceberg_env
    driver: bridge

For direction on the whole setup I have it documented in this this blog

Please signup or login to give your own answer.

Click here to cancel reply.

Docker – read files with spark from minio

Answers