skip to Main Content

I am pretty baffled and I don’t know what is going on with this one.

I’m using DuckDB to query parquet files in an s3 bucket.

import pandas as pd
import duckdb

query = """
    INSTALL httpfs;
    LOAD httpfs;
    SET s3_region='us-west-2';
    SET s3_access_key_id='key';
    SET s3_secret_access_key='secret';
    SELECT 
        FROM read_parquet('s3://bucket/folder/file.parquet') 

cursor = duckdb.connect()

cursor.execute(query).df()

I have an IAM user with admin access. I am able to query this parquet file with programatic access keys. I also have a role that I want to use in an application that I have also given admin access just for testing purposes.

When I assume the role and create temporary credentials and input those into the code above

export $(printf "AWS_ACCESS_KEY_ID=%s AWS_SECRET_ACCESS_KEY=%s AWS_SESSION_TOKEN=%s" 
$(aws sts assume-role 
--role-arn arn:aws:iam::<account-id>:role/<role-name> 
--role-session-name test-session 
--query "Credentials.[AccessKeyId,SecretAccessKey,SessionToken]" 
--output text))

I get the error

duckdb.Error: Invalid Error: Unable to connect to URL
"s3://bucket/folder/file.parquet": 403 (Forbidden)

However, when I use my IAM user, I am able to access this s3 object and query the data just fine. Is there something I am missing about the difference between roles and IAM users?

If it helps, what I am trying to do is create a role for a lambda function and then access the environmental variables AWS_ACCESS_KEY_ID, and AWS_SECRET_ACCESS_KEY with os.getenviron() in the code above. I believe if I can get the role working by writing in the temporary credentials it should work when I use os.getenv() in the lambda function.

2

Answers


  1. I had a very similar issue, after also setting the s3_session_token via SET s3_session_token='sessiontoken'; it worked. Also, be aware that S3 is not a global service, which means that you need to make sure to set the correct s3_region.

    The code would be changed to

    import pandas as pd
    import duckdb
    
    query = """
        INSTALL httpfs;
        LOAD httpfs;
        SET s3_region='us-west-2';
        SET s3_access_key_id='key';
        SET s3_secret_access_key='secret';
        SET s3_session_token='session-token';
        SELECT 
            FROM read_parquet('s3://bucket/folder/file.parquet') 
    
    cursor = duckdb.connect()
    
    cursor.execute(query).df()
    
    Login or Signup to reply.
  2. If the other answers worn’t help you, you might want to try setting another S3_region.

    I got the same error message Error: Invalid Error: IO Error: Unable to connect to URL "s3://elsa-data-lake/transformed/20230228_041047_00038_yihfd_00629123-f824-4a31-ba70-e341d4028a3b.parquet": 400 (Bad Request) but with different underlying problem. I set the wrong S3 region, thinking that S3 is a global service. I was using SET s3_region='us-east-1' because that is where our SSO service is set up, but I needed to specify the region where the file is stored on S3. So if I use SET s3_region='eu-west-1 everything works!

    Here is a screenshot from the S3 console that shows that my file was stored in "EU (Ireland) eu-west-1".
    enter image description here

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search