skip to Main Content

i have to select a set of rows (like 200 unique rows) from 200 million rows at once without order by and it must be efficient.

2

Answers


  1. As you are experiencing sorting 200M rows can take a while and if all you want is 200 rows then this is an expense you shouldn’t need to pay. However, you do need to sort on a random value if you want to select 200 rows that are random. Otherwise the sort order of the base tables and the order of reply from the Redshift slices will meaningfully skew you sample.

    You can get around this by sampling down (through a random process) so a much more manageable number of rows, then sort by the random value and pick your final 200 rows. While this does need to sort rows it does it upon a significantly smaller number which will speed things up considerably.

    select a, b from (
      select a, b, random() as ranno 
      from test_table)
    where ranno < .005
    order by ranno
    limit 200;
    

    You start with 200M rows. Select .5% of them in the WHERE clause. Then order these 10,000 rows before selecting 200. This should speed things up and maintain the randomness of the selection.

    Login or Signup to reply.
  2. Sampling down your data to a reasonable percentage like 10%,5%,1%,.. etc should bring your volume to a manageable size. Then you can order by the sample and choose the count of rows you need.

    select * from (select *, random() as sample 
                   from "table") 
    where sample < .01 
    order by sample limit 200
    

    The following is an expansion on the question which I found useful for me that others might find helpful as well. In my case, I had a huge table which I could split by a key field value into smaller subsets, but even after splitting it the volume per individual subset would stay very large (10s of millions of rows) and I still needed to sample it anyway. I was initially concerned that the sampling won’t work on the subset I created using With statement, but it turned out this is not the case. I compared the distribution of the sample across all different meaningful keys afterwards between the full subset (20 million) and the sample (30K) and I got almost the exact distribution which worked great. Sample code below:

    With subset as (select * from "table" Where Key_field='XYZ')
    select * from (select *, random() as sample 
                   from subset) s
    where s.sample < .01 
    order by s.sample limit 200
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search