Amazon web services - Best way to select random rows in redshift without order by - PhpOut

develop
July 26, 2022
394 views
0 votes
2 Answers

i have to select a set of rows (like 200 unique rows) from 200 million rows at once without order by and it must be efficient.

Tags: amazon-redshift amazon-web-services

Answers

- BillWeiner
- July 26, 2022 at 8:37 pm
- 0 votes
0
As you are experiencing sorting 200M rows can take a while and if all you want is 200 rows then this is an expense you shouldn’t need to pay. However, you do need to sort on a random value if you want to select 200 rows that are random. Otherwise the sort order of the base tables and the order of reply from the Redshift slices will meaningfully skew you sample.

You can get around this by sampling down (through a random process) so a much more manageable number of rows, then sort by the random value and pick your final 200 rows. While this does need to sort rows it does it upon a significantly smaller number which will speed things up considerably.
```
select a, b from (
  select a, b, random() as ranno 
  from test_table)
where ranno < .005
order by ranno
limit 200;
```
You start with 200M rows. Select .5% of them in the WHERE clause. Then order these 10,000 rows before selecting 200. This should speed things up and maintain the randomness of the selection.
Login or Signup to reply.

- GJebran
- January 13, 2023 at 5:18 pm
- 0 votes
0
Sampling down your data to a reasonable percentage like 10%,5%,1%,.. etc should bring your volume to a manageable size. Then you can order by the sample and choose the count of rows you need.
```
select * from (select *, random() as sample 
               from "table") 
where sample < .01 
order by sample limit 200
```
The following is an expansion on the question which I found useful for me that others might find helpful as well. In my case, I had a huge table which I could split by a key field value into smaller subsets, but even after splitting it the volume per individual subset would stay very large (10s of millions of rows) and I still needed to sample it anyway. I was initially concerned that the sampling won’t work on the subset I created using With statement, but it turned out this is not the case. I compared the distribution of the sample across all different meaningful keys afterwards between the full subset (20 million) and the sample (30K) and I got almost the exact distribution which worked great. Sample code below:
```
With subset as (select * from "table" Where Key_field='XYZ')
select * from (select *, random() as sample 
               from subset) s
where s.sample < .01 
order by s.sample limit 200
```
Login or Signup to reply.

Please signup or login to give your own answer.

Click here to cancel reply.