Redshift duplicate records - Amazon Web Sevices

DhilipH
March 4, 2023
165 views
0 votes
2 Answers

I am loading CSV files to Redshift. I first run the COPY command to a staging table, then normalize the account data into a separate lookup table.

BEGIN;
    INSERT INTO AccountLookup (Account)
    SELECT DISTINCT Account FROM Staging db
    WHERE NOT EXISTS (SELECT 1 FROM AccountLookup acct WHERE acct.Account = db.Account);
END;

I expect the above query to not insert any duplicate rows into the AccountLookup table. It works most of the time. But I rarely do find some duplicate entries which is causing some problems.

Next, I am using the following query to transfer the data from staging table to main table. Here I perform join on the lookup table to get the account id. Since the lookup table has duplicate entries, the main table also gets duplicate entries which it shouldn’t have.

BEGIN;
    INSERT INTO Main (Usage, AccountId, Cost)
    SELECT bd.Usage, acct.Id, bd.Cost FROM Staging bd
    LEFT JOIN AccountLookup acct ON bd.Account = acct.Account;
END;

I can eliminate duplicates in the Main table by using the row_number() window function. But I would like to avoid inserting duplicate entries in the lookup table. Any help would be appreciated.

Answers

- AdarshT
- March 4, 2023 at 7:48 pm
- 0 votes
0
Try this:
```
BEGIN;
    INSERT INTO Main (Usage, AccountId, Cost)
    SELECT bd.Usage, acct.Id, bd.Cost FROM Staging bd
    LEFT JOIN AccountLookup acct ON bd.Account = acct.Account
    GROUP BY bd.Usage, acct.Id, bd.Cost;
END;
```
Login or Signup to reply.

- nbk
- March 4, 2023 at 8:48 pm
- 0 votes
0
You should only INSERT DISTINCT rows, that are not in the main TABLE, which you can ensure with EXISTS

You should alos do that for the first query
```
BEGIN;
    INSERT INTO Main (Usage, AccountId, Cost)
    SELECT DISTINCT bd.Usage, acct.Id, bd.Cost FROM Staging bd
    LEFT JOIN AccountLookup acct ON bd.Account = acct.Account
    WHERE NOT EXISTS (SELECT 1 FROM Main  
        WHERE main.Usage = bd.Usage AND main.AccountId = acct.Id 
         AND main.Cost =  bd.Cost);
END;
```
Login or Signup to reply.

Please signup or login to give your own answer.

Click here to cancel reply.

Redshift duplicate records – Amazon Web Sevices

Answers