skip to Main Content

I have Paraquet files in my S3 bucket which is not AWS S3.

Is there a tool that connects to any S3 service (like Wasabi, Digital Ocean, MinIO), and allows me to query the Parquet files?

2

Answers


  1. With MongoDB this can be done with our Atlas Data Federation product
    https://www.mongodb.com/docs/atlas/data-federation/overview/

    It can query parquet files stored in S3.

    Login or Signup to reply.
  2. In case you need a GUI tool then you can use DBeaver + DuckDB.
    For programmatic use, You can find DuckDB library for most languages.

    Here is my other answer on the same topic.

    There is a slight difference since you are querying data on a S3 compatible storage.
    You simply need to run few additional commands mentioned here in the DuckDB docs.

    -- Execute these commnds one by one or run as a script from DBeaver
    INSTALL httpfs;
    LOAD httpfs;
    SET s3_region='us-east-1';
    SET s3_access_key_id='****';
    SET s3_secret_access_key='****';
    
    -- Read the data using S3 protocol
    select * from read_parquet('s3://test-me-ses/userdata1.parquet');
    

    In case you have parquet files hosted and served from S3 or any web server via HTTP – DuckDB has this covered as well.

    -- Read the data using HTTP protocol if parquet file is hosted
    SELECT * FROM read_parquet('https://test-me-ses.s3.amazonaws.com/userdata1.parquet');
    

    Any S3 compatible object store(Wasabi, Digital Ocean, MinIO) should work similarly..

    enter image description here

    You can also write the data back as parquet after transformation to any S3 compatible storage(AWS, MinIO etc..).

    All of these can be done programmatically as well.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search