I am loading CSV files to Redshift. I first run the COPY
command to a staging table, then normalize the account data into a separate lookup table.
BEGIN;
INSERT INTO AccountLookup (Account)
SELECT DISTINCT Account FROM Staging db
WHERE NOT EXISTS (SELECT 1 FROM AccountLookup acct WHERE acct.Account = db.Account);
END;
I expect the above query to not insert any duplicate rows into the AccountLookup
table. It works most of the time. But I rarely do find some duplicate entries which is causing some problems.
Next, I am using the following query to transfer the data from staging table to main table. Here I perform join on the lookup table to get the account id. Since the lookup table has duplicate entries, the main table also gets duplicate entries which it shouldn’t have.
BEGIN;
INSERT INTO Main (Usage, AccountId, Cost)
SELECT bd.Usage, acct.Id, bd.Cost FROM Staging bd
LEFT JOIN AccountLookup acct ON bd.Account = acct.Account;
END;
I can eliminate duplicates in the Main table by using the row_number()
window function. But I would like to avoid inserting duplicate entries in the lookup table. Any help would be appreciated.
2
Answers
Try this:
You should only
INSERT
DISTINCT
rows, that are not in the main TABLE, which you can ensure with EXISTSYou should alos do that for the first query