skip to Main Content

The business logic is as below:

  1. The user upload a csv file;
  2. The application convert the csv file to a database table;
  3. In the future, the user could run sql on the table to generate a BI report;

Currently, the solution is to save the table to MySQL. But as times goes on, the MySQL database contains thousands of tables.

I want find a file format, which represent a table and can be put to a object storage such as AWS S3, and then run an sql on the file.

For example:

Datasource ds = new Datasource("s3://xxx/bbb/t1.tbl");

ResultSet rs = ds.runSQL("select c1, c2 from t1 where c3=8");

What is your ideas or solutions?

2

Answers


  1. Amazon S3 can run an SQL query against a single CSV file. It uses a capability called S3 Select.

    From Filtering and retrieving data using Amazon S3 Select – Amazon Simple Storage Service:

    With Amazon S3 Select, you can use simple structured query language (SQL) statements to filter the contents of an Amazon S3 object and retrieve just the subset of data that you need. By using Amazon S3 Select to filter this data, you can reduce the amount of data that Amazon S3 transfers, which reduces the cost and latency to retrieve this data.

    You can make an API call to S3 to perform the SQL query and retrieve the results. No database required. Just pay for the storage used by the CSV files (which can be gzipped to save space), plus $0.002 per GB scanned and $0.0007 per GB returned.

    Login or Signup to reply.
  2. You can store the file as CSV in S3 and use S3 Select as mentioned in the other answer. Or you can store it as CSV or Parquet (a much more performant format) and run queries against it using AWS Athena.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search