Amazon web services - Redshift query duplicates

Newl
November 15, 2024
173 views
0 votes
2 Answers

I’m using python with redshift_connector, and analysing the data with pandas. When accessing a redshift db with selecting n columns, I got i lines. However when I wanted to add a new column to this query, it timed out after an hour. To solve the issue, I came up with the idea to select the n+1 columns, use LIMIT and OFFSET in an iterative manner to get every row. After a while it gave back i lines, but something did not add up. When I compared the results, the latter yielded a couple of duplicate rows. How can one write a query so that it would not time-out, but wouldn’t give back duplicates?

Original mock query that won’t time out:

SELECT a, b, c 
FROM table
WHERE a IN ('attribute1','attribute2')

Timeout:

SELECT a, b, c, d
FROM table
WHERE a IN ('attribute1','attribute2')

If I put the second one in a while True loop, amend it with the LIMIT and OFFSET, use pd.read_sql(query, connection) to get the data, append it to a df list, and concat the list in the end, it gives me back the exact amount of lines that the first one, but with duplicates.

Answers

- BillWeiner
- November 8, 2024 at 4:29 am
- 0 votes
0
Without an ORDER BY clause this is what I would expect. Redshift is a clustered database where the response time of the various nodes will determine the order of the returned rows. Since this timing is based on many uncontrolled variables each time you run the query the row order will be different.

So add an ORDER BY clause and make sure it orders all rows with no arbitrary ordering possible. Doing this may take some time to figure out as no two rows can have the same order position based on the columns you choose. Also, adding the work to order the rows may put you back into a timeout situation.

Basically I don’t think this is likely the best way to approach this issue. This approach has no guarantee of returning coherent data. Tables can change between the iterative queries.

Understanding why it is timing out just due to adding a column. Is the working query nearly taking an hour? What is taking so long – query time? data transfer time? What timeout are you hitting?

Depending on what is going on with the timeouts the better way to iterate over returned data is with a cursor. No ordering required and no coherency concerns.

Login or Signup to reply.

- nvskavinash
- November 15, 2024 at 8:59 pm
- 0 votes
0
You could use DISTINCT to avoid duplicates and to avoid time-out you could filter on the table with a date range if you have any cols storing date range

Login or Signup to reply.

Please signup or login to give your own answer.

Click here to cancel reply.

Amazon web services – Redshift query duplicates

Answers