My organization stores long term data in multiple S3 buckets. Given that the data format is consistent across the different regions, I would like to create one table that I can query with Athena, rather than specifying the particular bucket.
Initially (base on previous questions here) I thought the way to do this was to create two tables and then a primary table, for example:
*britishUsers. -> s3 bucket in UK
*americanUsers -> s3 bucket in US
then create a primary table using the SQL
CREATE TABLE users
AS SELECT * britishUsers UNION SELECT * americanUsers;
My problem is that I believe these types of tables (CTAS) replicate the data in another S3 location which is problematic from a privacy perspective. I’m also not sure if the table will update as more data is added to the secondary tables. Is a view the best way forward here?
2
Answers
Create view as a select will indeed avoid the copy. Data will remain fresh as well.
Privacy concerns could bring additional requirements (other than data locality) like anonymization and consent management. IMHO this leads to the creation of a specific table and a copy of only the data you can use. You will have to update regularly this table to include recent updates. You will need to capture the last updates in your records to track changes.
I agree with @MarcC’s response. I’m using a similar pattern to join historic vs. current data with a union all. Not the most performant option but using the views give additional benefits to format or convert data as necessary.