I use Redshift + Airflow, I need to download various reports from S3, there are 6 of them, each report is decomposed by an hour (i.e. 24 files per day for one report) and this is from 5 more accounts. In general, every hour I download 30 files. I want to parallelize the download, the bottleneck is the table target (6, by the number of reports). Therefore, for each report and account, I create staging tables , into which I will COPY from S3. And then APPEND
to the target table. COPY
and APPEND
are separated into different Airflow tasks.
The problem is that there are many columns in the reports, the block size in Redshift is 1 Mb and the intermediate table is 5 GB, while the data is stored in a rather sparse form, I can insert 4-5 of the same portions of data into this table before the table size starts to grow . It looks like this.
-- Task 1
DROP TABLE IF EXISTS schema.wrk_acc_report_h_13_dt_2023_08_07;
CREATE TABLE schema.wrk_acc_report_h_13_dt_2023_08_07 (LIKE schema.report_target_table);
COPY schema.wrk_acc_report_h_13_dt_2023_08_07 (columns.....)
FROM 's3://...../dt=2023-08-07/h=13'
ACCESS_KEY_ID '12312323123' SECRET_ACCESS_KEY '12321313123'
REGION '111111'
CSV GZIP IGNOREHEADER 1 NULL 'null' TRUNCATECOLUMNS ACCEPTINVCHARS AS '?';
-- Task 2
ALTER TABLE schema.report_target_table APPEND FROM schema.wrk_acc_report_h_13_dt_2023_08_07 FILLTARGET;
After APPEND
target tables grow to obscene sizes, each load +25 GB. I can use VACUUM
for target tables, but the operation is quite heavy and using it in the middle of the day is not very good, well, it turns out to be a bit long. INSERT
is long.
Is it possible to somehow more optimally create intermediate tables so that they are more compact and weigh less?
2
Answers
Rather than using
COPY
to load data into staging tables, you could useCREATE EXTERNAL TABLE
to point directly at the data using Redshift Spectrum.You could then use:
According to the ALTER TABLE APPEND documentation, the INSERT will likely be slower but it will totally avoid the need for the staging table within the Redshift cluster.
Alternative approach
In fact, if the incoming data is all in the same format (gzip csv with the same column order), then you can actually accumulate data in the same path within an S3 bucket and access it via an External Table without needing to load any of the data into the cluster. The approach would be:
As you correctly point out Redshift’s block size is 1MB and this means that each column on each slice needs at least 1 MB for the first piece of data in that slice. So if you have 1000 columns in your table on a cluster that has 16 slices then once each slice has 1 row the table is 16GB in size. (A few assumptions skipped.) If the table only has a few hundred rows there is a lot of empty space in these blocks.
You are using ALTER TABLE APPEND to add the staging table data to the target table which is very performant. This process just moves the blocks from the source table to the target table without processing through all the data. Just unlink the block from the source table and link it to the target, done. This is very fast and requires that the source and target have the same DDL.
What it sounds like is happening is that your staging tables are not filling up the blocks or even close to doing so. Then these very sparse blocks are appended to the target table repeatedly and this makes your target table size large.
There are several ways to look at this situation depending on your point of view:
These lead to likely solutions to the issue:
I suspect that #1 is the better answer in your case but this is based on assumptions I’m making when reading your question.
As to having optimal intermediate tables this would require having more data in the source tables before the append operation. (More appropriate source tables for the ALTER TABLE APPEND.) I don’t know your situation well enough to know if you can make source tables that have mostly "full" blocks within the constraints of your solution. Short of this I don’t think there is a way to "optimize" the source tables.
Redshift is optimized for large, packed tables and issue like this arise when working with many small tables (even if these are created over time but the "smallness" is carried forward). Large and small tables in this context are a function of the cluster size (# of slices). With fewer slices the blocks "fill up" faster. This is one of the reasons that DISTSTYLE ALL can be an advantage, fewer slices, but far from the most important.