skip to Main Content

i’m trying to play around with the platforms, creating some sort of data lake/warehouse.
So i have the minio instance on localhost:9000, and can i use the ui and upload files. great.

I create the network named spark network and started:

docker volume create s3_drive
docker run --network spark_network --network-alias iceberg-lake.minio -p 9000:9000 -p 9001:9001 -v s3_drive:/data -e MINIO_ROOT_USER=username -e MINIO_ROOT_PASSWORD=password -d --name minio quay.io/minio/minio server /data --console-address ":9001"

I’ve download the docker image

FROM quay.io/jupyter/all-spark-notebook

# Set the working directory to /code
WORKDIR /code

# Copy PostgreSQL driver
COPY postgresql-42.7.1.jar /opt/jars/postgresql-42.7.1.jar

# Expose Spark UI and JupyterLab ports
EXPOSE 4040 8888

# Start JupyterLab on container start
CMD ["start-notebook.sh", "--NotebookApp.token=''"]

followed by

docker volume create jupyter_code
docker run -p 8888:8888 -p 4040:4040 -e AWS_REGION=us-east-1 -e MINIO_REGION=us-east-1 -e AWS_S3_ENDPOINT=http://minio:9000 -e AWS_ACCESS_KEY_ID=key_id -e AWS_SECRET_ACCESS_KEY=secret_key --network spark_network -v jupyter_code:/code -d --name spark spark-jupyterlab

so everything is up, and i have another postgres on the network.

when i try to read from the minio server with spark (jupyter lab), i’m creating the session:

from pyspark.sql import SparkSession

spark = SparkSession.builder 
    .appName("iceberg_jdbc") 
    .config("spark.jars", "/opt/jars/postgresql-42.7.1.jar") 
    .config('spark.jars.packages', 'org.apache.hadoop:hadoop-aws:3.3.4,org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.4.2,org.apache.iceberg:iceberg-aws-bundle:1.4.2') 
    .config('spark.sql.extensions', 'org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions') 
    .config('spark.sql.catalog.cves', 'org.apache.iceberg.spark.SparkCatalog') 
    .config('spark.sql.catalog.cves.io-impl', 'org.apache.iceberg.aws.s3.S3FileIO') 
    .config('spark.sql.catalog.cves.warehouse', 's3a://iceberg-lake/lake/') 
    .config('spark.sql.catalog.cves.s3.endpoint', 'http://minio:9000') 
    .config('spark.sql.catalog.cves.s3.path-style-access', 'true') 
    .config('spark.sql.catalog.cves.catalog-impl', 'org.apache.iceberg.jdbc.JdbcCatalog') 
    .config('spark.sql.catalog.cves.uri', 'jdbc:postgresql://postgres-test:5432/metastore') 
    .config('spark.sql.catalog.cves.jdbc.verifyServerCertificate', 'false') 
    .config('spark.sql.catalog.cves.jdbc.useSSL','false') 
    .config('spark.sql.catalog.cves.jdbc.user', '<pg user>') 
    .config('spark.sql.catalog.cves.jdbc.password','<pg password>') 
    .config("spark.hadoop.fs.s3a.access.key", "<access-key>") 
    .config("spark.hadoop.fs.s3a.secret.key", "<secret-key>") 
    .getOrCreate()

after all that, i’m trying to read one json file from bucket on minio called "iceberg-lake".

i tried to do it wihtout changing or adding any policy or user (but of course with access keys), and also with user and policy and what not.
the end result is always the same, when i try to read like this

df = spark.read.json("s3a://iceberg-lake/lake/1999/0xxx/CVE-1999-0001.json")

i’m getting this:

Py4JJavaError: An error occurred while calling o60.json.
: java.nio.file.AccessDeniedException: s3a://iceberg-lake/lake/1999/0xxx/CVE-1999-0001.json: getFileStatus on s3a://iceberg-lake/lake/1999/0xxx/CVE-1999-0001.json: com.amazonaws.services.s3.model.AmazonS3Exception: Forbidden (Service: Amazon S3; Status Code: 403; Error Code: 403 Forbidden; Request ID: AGJWYZYKWD26CZXH; S3 Extended Request ID: 0haOOhMeMtsgJN5OMwmehLBWVFkDqZmy8WHvA+Tiym4IBNR0x88kIPEH3ddonlVPS6/FXcOyfrI=; Proxy: null), S3 Extended Request ID: 0haOOhMeMtsgJN5OMwmehLBWVFkDqZmy8WHvA+Tiym4IBNR0x88kIPEH3ddonlVPS6/FXcOyfrI=:403 Forbidden
    at org.apache.hadoop.fs.s3a.S3AUtils.translateException(S3AUtils.java:255)
    at org.apache.hadoop.fs.s3a.S3AUtils.translateException(S3AUtils.java:175)
    at org.apache.hadoop.fs.s3a.S3AFileSystem.s3GetFileStatus(S3AFileSystem.java:3796)
    at org.apache.hadoop.fs.s3a.S3AFileSystem.innerGetFileStatus(S3AFileSystem.java:3688)
    at org.apache.hadoop.fs.s3a.S3AFileSystem.lambda$exists$34(S3AFileSystem.java:4703)
    at org.apache.hadoop.fs.statistics.impl.IOStatisticsBinding.lambda$trackDurationOfOperation$5(IOStatisticsBinding.java:499)
    at org.apache.hadoop.fs.statistics.impl.IOStatisticsBinding.trackDuration(IOStatisticsBinding.java:444)
    at org.apache.hadoop.fs.s3a.S3AFileSystem.trackDurationAndSpan(S3AFileSystem.java:2337)
    at org.apache.hadoop.fs.s3a.S3AFileSystem.trackDurationAndSpan(S3AFileSystem.java:2356)
    at org.apache.hadoop.fs.s3a.S3AFileSystem.exists(S3AFileSystem.java:4701)
    at org.apache.spark.sql.execution.datasources.DataSource$.$anonfun$checkAndGlobPathIfNecessary$4(DataSource.scala:756)
    at org.apache.spark.sql.execution.datasources.DataSource$.$anonfun$checkAndGlobPathIfNecessary$4$adapted(DataSource.scala:754)
    at org.apache.spark.util.ThreadUtils$.$anonfun$parmap$2(ThreadUtils.scala:380)
    at scala.concurrent.Future$.$anonfun$apply$1(Future.scala:659)
    at scala.util.Success.$anonfun$map$1(Try.scala:255)
    at scala.util.Success.map(Try.scala:213)
    at scala.concurrent.Future.$anonfun$map$1(Future.scala:292)
    at scala.concurrent.impl.Promise.liftedTree1$1(Promise.scala:33)
    at scala.concurrent.impl.Promise.$anonfun$transform$1(Promise.scala:33)
    at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:64)
    at java.base/java.util.concurrent.ForkJoinTask$RunnableExecuteAction.exec(ForkJoinTask.java:1395)
    at java.base/java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:373)
    at java.base/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(ForkJoinPool.java:1182)
    at java.base/java.util.concurrent.ForkJoinPool.scan(ForkJoinPool.java:1655)
    at java.base/java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1622)

I tried different policies and different keys, with and without users. i deleted all the images and build from scratch again. tried to play around with new container and all.
still, i can’t read one json file (in the end i want to read the whole folder of course)

can anybody help?

2

Answers


  1. the fact the error comes back with an extended request id implies the response is coming from aws, not minio

    why don’t you set "spark.hadoop.fs.s3a.endpoint" to the endpoint, that way the s3a connector gets the value?

    Login or Signup to reply.
  2. I have a docker compose for this kind of evaluation setup:

    version: "3.9"
    
    services:
      dremio:
        platform: linux/x86_64
        image: dremio/dremio-oss:latest
        ports:
          - 9047:9047
          - 31010:31010
          - 32010:32010
        container_name: dremio
    
      minioserver:
        image: minio/minio
        ports:
          - 9000:9000
          - 9001:9001
        environment:
          MINIO_ROOT_USER: minioadmin
          MINIO_ROOT_PASSWORD: minioadmin
        container_name: minio
        command: server /data --console-address ":9001"
    
      spark_notebook:
        image: alexmerced/spark33-notebook
        ports: 
          - 8888:8888
        env_file: .env
        container_name: notebook
    
      nessie:
        image: projectnessie/nessie
        container_name: nessie
        ports:
          - "19120:19120"
    
    networks:
      default:
        name: iceberg_env
        driver: bridge 
    

    For direction on the whole setup I have it documented in this this blog

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search