skip to Main Content

I use Redshift + Airflow, I need to download various reports from S3, there are 6 of them, each report is decomposed by an hour (i.e. 24 files per day for one report) and this is from 5 more accounts. In general, every hour I download 30 files. I want to parallelize the download, the bottleneck is the table target (6, by the number of reports). Therefore, for each report and account, I create staging tables , into which I will COPY from S3. And then APPEND to the target table. COPY and APPEND are separated into different Airflow tasks.

The problem is that there are many columns in the reports, the block size in Redshift is 1 Mb and the intermediate table is 5 GB, while the data is stored in a rather sparse form, I can insert 4-5 of the same portions of data into this table before the table size starts to grow . It looks like this.

-- Task 1
DROP TABLE IF EXISTS  schema.wrk_acc_report_h_13_dt_2023_08_07;

CREATE TABLE schema.wrk_acc_report_h_13_dt_2023_08_07 (LIKE schema.report_target_table);
                            
COPY schema.wrk_acc_report_h_13_dt_2023_08_07 (columns.....)
            FROM 's3://...../dt=2023-08-07/h=13'
            ACCESS_KEY_ID '12312323123' SECRET_ACCESS_KEY '12321313123'
            REGION '111111'
            CSV GZIP IGNOREHEADER 1 NULL 'null' TRUNCATECOLUMNS ACCEPTINVCHARS AS '?';
-- Task 2
ALTER TABLE schema.report_target_table APPEND FROM schema.wrk_acc_report_h_13_dt_2023_08_07 FILLTARGET;

After APPEND target tables grow to obscene sizes, each load +25 GB. I can use VACUUM for target tables, but the operation is quite heavy and using it in the middle of the day is not very good, well, it turns out to be a bit long. INSERT is long.

Is it possible to somehow more optimally create intermediate tables so that they are more compact and weigh less?

2

Answers


  1. Rather than using COPY to load data into staging tables, you could use CREATE EXTERNAL TABLE to point directly at the data using Redshift Spectrum.

    You could then use:

    INSERT INTO target_table SELECT * FROM external_table
    

    According to the ALTER TABLE APPEND documentation, the INSERT will likely be slower but it will totally avoid the need for the staging table within the Redshift cluster.

    Alternative approach

    In fact, if the incoming data is all in the same format (gzip csv with the same column order), then you can actually accumulate data in the same path within an S3 bucket and access it via an External Table without needing to load any of the data into the cluster. The approach would be:

    • Create an External Table (you only need to do this once unless you want separate tables for every hourly report)
    • An external process could write or move CSV files to the selected path in an S3 bucket
    • All files within that path will then be treated as being ‘part of the external table’ and no storage is used within the cluster (data is instead directly accessed from S3 when the table is queried)
    Login or Signup to reply.
  2. As you correctly point out Redshift’s block size is 1MB and this means that each column on each slice needs at least 1 MB for the first piece of data in that slice. So if you have 1000 columns in your table on a cluster that has 16 slices then once each slice has 1 row the table is 16GB in size. (A few assumptions skipped.) If the table only has a few hundred rows there is a lot of empty space in these blocks.

    You are using ALTER TABLE APPEND to add the staging table data to the target table which is very performant. This process just moves the blocks from the source table to the target table without processing through all the data. Just unlink the block from the source table and link it to the target, done. This is very fast and requires that the source and target have the same DDL.

    What it sounds like is happening is that your staging tables are not filling up the blocks or even close to doing so. Then these very sparse blocks are appended to the target table repeatedly and this makes your target table size large.

    There are several ways to look at this situation depending on your point of view:

    1. ALTER TABLE APPEND is not appropriate for this use case as the source blocks are not properly "structured" and creating downstream issues.
    2. The "wasted space" on the target table needs to be optimized out after each append to keep the size in check.

    These lead to likely solutions to the issue:

    1. INSERT the data into the target table rather than ALTER TABLE APPEND. The data size is small enough that the performance lost with an INSERT will be more than made up for by having compact table size and less frequent VACUUMs.
    2. Perform a delete only vacuum after each staging table is appended. Delete only vacuum is much faster than a full vacuum as it doesn’t do any row sorting, just removing dead space in the table.

    I suspect that #1 is the better answer in your case but this is based on assumptions I’m making when reading your question.

    As to having optimal intermediate tables this would require having more data in the source tables before the append operation. (More appropriate source tables for the ALTER TABLE APPEND.) I don’t know your situation well enough to know if you can make source tables that have mostly "full" blocks within the constraints of your solution. Short of this I don’t think there is a way to "optimize" the source tables.

    Redshift is optimized for large, packed tables and issue like this arise when working with many small tables (even if these are created over time but the "smallness" is carried forward). Large and small tables in this context are a function of the cluster size (# of slices). With fewer slices the blocks "fill up" faster. This is one of the reasons that DISTSTYLE ALL can be an advantage, fewer slices, but far from the most important.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search